Methods for determining the cause of somatic mutagenesis

ABSTRACT

A method can determine the likelihood that targeted somatic mutagenesis of a nucleic acid molecule by a mutagenic agent has occurred. The method includes analyzing the sequence of the nucleic acid molecule to determine, for a number of mutations of a mutation type at one or more motifs recognized or targeted by the mutagenic agent, the codon context of those mutations to thereby identify the location of a mutation and mutation type for each of the mutated codons in the nucleic acid molecule. The codon context of an individual mutation can be determined by determining at which of the three positions of a corresponding mutated codon the individual mutation occurs. The mutagenic agent can be one of aflatoxin, activation-induced cytidine deaminase (AID), or an apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like (APOBEC) cytidine deaminase.

PRIORITY AND CROSS REFERENCE TO RELATED APPLICATIONS

This application is a Continuation Application of U.S. application Ser.No. 14/440,837, filed May 5, 2015, which is a U.S. National PhaseApplication under 35 U.S.C. § 371 of International Application No.PCT/AU2013/001275, filed Nov. 5, 2013, designating the U.S. andpublished in English as WO 2014/066955 A1 on May 8, 2014, which claimspriority to Australian Provisional Application No. 2012904826, entitled“Method for the diagnosis of disease associated with somaticmutagenesis”, filed on 5 Nov. 2012; Australian Provisional ApplicationNo. 2012904940, entitled “Database system and method for the diagnosisof disease associated with somatic mutagenesis”, filed on 13 Nov. 2012;and Australian Provisional Application No. 2013901253, entitled “Methodfor the diagnosis of disease associated with mutagenesis using smallnumbers of somatic mutations”, filed on 12 Apr. 2013. The subject matterof Australian Provisional Application Nos. 2012904826, 2012904940 and2013901253 is incorporated herein by reference in its entirety. Any andall applications for which a foreign or a domestic priority is claimedis/are identified in the Application Data Sheet filed herewith andis/are hereby incorporated by reference in their entirety under 37C.F.R. § 1.57.

FIELD

This invention relates generally to methods for determining thelikelihood that targeted somatic mutagenesis of a nucleic acid moleculeby a mutagenic agent has occurred, and the likelihood that a mutagenicagent is a cause of targeted somatic mutagenesis of a nucleic acidmolecule. The invention further relates to methods for diagnosing cancerin a subject and/or determining the likelihood that a subject has orwill develop cancer, and methods for treating subjects diagnosed withcancer or determined to be likely to have or to develop cancer. Infurther aspects, the invention relates to methods for identifying motifsin nucleic acid molecules that are recognized or targeted by mutagenicagents.

BACKGROUND

The progression of normal cells to cancer cells can be influenced by avariety of factors, including changes in the immune system, hormonalstatus, gene expression and signalling between tissues. A particularlyimportant factor in cancer progression is somatic mutation, which playsa role in cancers of most, if not all, tissue types.

The accumulation of somatic mutations in various genes appears directlyrelated to cancer progression. This has been demonstrated using variousanimal models in which an increase in somatic mutagenesis resultingfrom, for example, impaired DNA polymerase proofreading or DNA repair,was associated with accelerated tumor progression (see e.g. Venkatesanet al. (2007). Mol. Cell. Biol. 27: 7669-7682; and Albertson (2009)Proc. Natl. Acad. Sci. U.S.A. 106, 17101-17104). Increased somaticmutagenesis of various genes has also been associated with a variety ofcancers. For example, somatic mutations in the TP53 gene are one of themost frequent alterations in human cancers. Somatic TP53 mutations occurin almost every type of cancer at rates from 38%-50% in ovarian,esophageal, colorectal, head and neck, larynx, and lung cancers to about5% in primary leukemia, sarcoma, testicular cancer, malignant melanoma,and cervical cancer, and advanced stage or aggressive cancer subtypes(such as triple negative or HER2-amplified breast cancers) areassociated with an increased frequency of somatic mutations in TP53(reviewed in Olivier et al. (2010) Cold Spring Harb Perspect Biol2:a001008). Other genes associated with cancer that accumulate somaticmutations include, for example, BRAF, HRAS, KRAS2 and NRAS, althoughover 25000 genes are now included in COSMIC, the online database ofsomatically acquired mutations found in human cancer.

Somatic mutagenesis can be caused by environmental factors, such ascigarette smoke, UV light and radiation, and/or biological factors orprocesses, such as chromosome translocation, DNA mis-repair ornon-repair, and enzyme-initiated somatic hypermutation (SHM).Determining the cause and extent of somatic mutagenesis in cells can notonly assist in diagnosing conditions associated with somatic mutagenesisor predicting the risk of developing such conditions, but can alsoassist in developing the most appropriate treatment or preventionprotocols. Thus, there is a need for accurate methods for determiningthe presence of somatic mutagenesis and identifying which mutagenicagent or agents are responsible for somatic mutagenesis in a subject.

SUMMARY

The present invention is predicated in part on the determination thatthere is a bias towards somatic mutagenesis by various mutagenic agentsat motifs when the motifs are present in a particular codon contextwithin the nucleic acid molecule. Thus, while it was previouslyunderstood that some mutagenic agents target motifs, as described hereinmutagenesis at these motifs occurs predominantly when the motifs arewithin a particular codon context, a process termed herein targetedsomatic mutagenesis. By identifying this additional requirement for thecodon context of the motif, the present inventors have developed methodsfor determining the likelihood that this type of targeted somaticmutagenesis has occurred, and the likelihood that one or more particularmutagenic agents are the cause of the targeted somatic mutagenesis.Generalized methods for identifying motifs targeted by mutagenic agentsthrough assessing instances of targeted somatic mutation have also beendeveloped and are described herein.

As the accumulation of somatic mutations is associated with thedevelopment and progression of cancer, methods for diagnosing cancer ina subject and determining the likelihood that a subject has or willdevelop cancer have also been developed. By identifying the causativemutagenic agent and/or diagnosing the cancer or likelihood of developingcancer, appropriate and specific treatment protocols can be developed toinhibit or reduce the activity of the mutagenic agent, and/or treat orprevent the cancer.

Thus, in one aspect, the present invention is directed to methods fordetecting or determining the likelihood that targeted somaticmutagenesis of a nucleic acid molecule by a mutagenic agent hasoccurred, comprising analyzing the sequence of the nucleic acid moleculeto determine the codon context of mutations of a mutation type at one ormore motifs, wherein a determination that targeted somatic mutagenesishas been detected or is likely to have occurred is made when there is ahigher than expected percentage or number of the mutations at oneposition in codons in the nucleic acid molecule. .

Generally, the expected percentage or number of mutations is calculatedby assuming that mutations occur independently of codon context. In someembodiments, the expected percentage of mutations is approximately 11%or 17%, and/or the expected number of mutations is approximately 1 ofevery 9 mutations or 1 of every 6 mutations. In some examples, thepercentage of mutations is observed to be at least 30%, 35%, 40%, 45%,50%, 55%, 60%, 65%, 70%, 80%, 85%, 90%, 95% or more.

The methods for determining whether targeted somatic mutagenesis hasoccurred can further comprise determining which mutagenic agent is alikely cause of the targeted somatic mutagenesis. The mutagenic agentcan be selected from, for example, aflatoxin, 4-aminobiphenyl,aristolochic acids, arsenic compounds, asbestos, azathioprine, benzene,benzidine, beryllium and beryllium compounds, 1,3-butadiene,1,4-butanediol dimethylsulfonate, cadmium and cadmium compounds,chlorambucil, 1-(2-chloroethyl)-3-(4-methylcyclohexyl)-1-nitrosourea(MeCCNU), bis(chloromethyl) ether and technical-grade chloromethylmethyl ether, chromium hexavalent compounds, coal tar pitches, coaltars, coke oven emissions, cyclophosphamide, cyclosporin A,diethylstilbestrol (DES), erionite, ethylene oxide, formaldehyde,melphalan, methoxsalen with ultraviolet A therapy (PUVA), mustard gas,2-naphthylamine, neutrons, nickel compounds, radon, crystalline silica(respirable size), solar radiation, soot, strong inorganic acid mistscontaining sulfuric acid, tamoxifen, 2,3,7,8-tetrachlorodibenzo-p-dioxin(TCDD), thiotepa, thorium dioxide, tobacco smoke, vinyl chloride,ultraviolet radiation, wood dust, X-radiation, gamma radiation,activation-induced cytidine deaminase (AID), an apolipoprotein BmRNA-editing enzyme catalytic polypeptide-like (APOBEC) cytidinedeaminase, and error-prone DNA polymerases. In some examples, the APOBECcytidine deaminase is selected from among APOBEC1, APOBEC3A, APOBEC3B,APOBEC3C, APOBEC3D, APOBEC3F, APOBEC3G and APOBEC3H.

In particular embodiments where the mutagenic agent is selected fromamong AID, APOBEC1, APOBEC3G, APOBEC3H and aflatoxin, a determinationthat AID is a likely cause of targeted somatic mutagenesis is made ifthe number or percentage of observed G>A mutations in GYW motifs at thesecond position in codons (MC-2 sites) in the non-transcribed strand ofthe nucleic acid molecule is higher than expected; a determination thatAID is a likely cause of targeted somatic mutagenesis is made if thenumber or percentage of observed C>T mutations in WRC motifs at thefirst position in codons (MC-1 sites) in the non-transcribed strand ofthe nucleic acid molecule is higher than expected; a determination thatAPOBEC3G is a likely cause of targeted somatic mutagenesis is made ifthe number or percentage of observed G>A mutations in CG motifs at MC-2sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; a determination that APOBEC3G is a likely cause oftargeted somatic mutagenesis is made if there is a higher than expectednumber or percentage of observed C>T mutations in CG motifs at MC-1sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; a determination that APOBEC3G is a likely cause oftargeted somatic mutagenesis is made if there is a higher than expectednumber or percentage of observed C>T mutations in CC motifs at MC-1sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; a determination that APOBEC3H is a likely cause oftargeted somatic mutagenesis is made if there is a higher than expectednumber or percentage of observed G>A mutations in GA motifs at MC-1sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; a determination that APOBEC1 is a likely cause oftargeted somatic mutagenesis is made if the number or percentage ofobserved C>T mutations in CA motifs at MC-1 sites in the non-transcribedstrand of the nucleic acid molecule is higher than expected; adetermination that APOBEC1 is a likely cause of targeted somaticmutagenesis is made if the number or percentage of observed G>Amutations in TG motifs at MC-2 sites in the non-transcribed strand ofthe nucleic acid molecule is higher than expected; and a determinationthat aflatoxin is a likely cause of targeted somatic mutagenesis is madeif the number or percentage of observed G>T mutations in GG motifs atMC-3 sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; wherein the nucleic acid molecule is from abiological sample from a subject.

Other embodiments of the methods for determining whether targetedsomatic mutagenesis has occurred further comprise determining whether anAID-associated mutation process is a likely cause of the targetedsomatic mutagenesis. For example, a determination that an AID-associatedmutation process is a likely cause of targeted somatic mutagenesis ismade if the number or percentage of observed A>G mutations in WA motifsat MC-2 sites, G>A mutations in GYW motifs at MC-2 sites, or C>Tmutations in WRC motifs at MC-1 sites, in the non-transcribed strand ofthe nucleic acid molecule is higher than expected.

In particular examples of the methods of the present invention, if AIDis determined to be a likely cause of targeted somatic mutagenesis, themethods further comprising administering an AID inhibitor to thesubject; if APOBEC3G is determined to be a likely cause of targetedsomatic mutagenesis, further comprising administering an APOBEC3Ginhibitor to the subject; if APOBEC3H is determined to be a likely causeof targeted somatic mutagenesis, further comprising administering anAPOBEC3G inhibitor to the subject; or if APOBEC1 is determined to be alikely cause of targeted somatic mutagenesis, further comprisingadministering an APOBEC1 inhibitor to the subject.

In further embodiments, the methods also comprise diagnosing cancer inthe subject or determining the likelihood that the subject will developcancer if it is determined that targeted somatic mutagenesis hasoccurred and/or a mutagenic agent is the likely cause of targetedsomatic mutagenesis.

In other aspects, the present invention is directed to methods fordetermining the likelihood that a subject has or will develop cancer,comprising analyzing a nucleic acid molecule from a biological samplefrom the subject to detect whether targeted somatic mutagenesis by oneor more mutagenic agents has occurred, and determining that the subjectis likely to have or to develop cancer when targeted somatic mutagenesishas occurred.

In one example, targeted somatic mutagenesis is detected when: thenumber or percentage of observed G to A mutations in GYW motifs at MC-2sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; the number or percentage of observed C>T mutationsin WRC motifs at MC-1 sites in the non-transcribed strand of the nucleicacid molecule is higher than expected; the number or percentage ofobserved G>A mutations in CG motifs at MC-2 sites in the non-transcribedstrand of the nucleic acid molecule is higher than expected; the numberor percentage of observed C>T mutations in CG motifs at MC-1 sites inthe non-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed C>T mutations in CAmotifs at MC-1 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedG>A mutations in GA motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; the number orpercentage of observed G>A mutations in TG motifs at MC-2 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed G>T mutations in GGmotifs at MC-3 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedC>T mutations in CC motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; or the number orpercentage of observed A>G mutations in WA motifs at MC-2 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected.

In particular examples, the mutagenic agent is determined to be AID ifthe number or percentage of observed G>A mutations in GYW motifs at MC-2sites or C>T mutations in WRC motifs at MC-1 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected; APOBEC3G if the number or percentage of observed G>A mutationsin CG motifs at MC-2 sites, C>T mutations in CG motifs at MC-1 sites orC>T mutations in CC motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; APOBEC1 if thenumber or percentage of observed C>T mutations in CA motifs at MC-1sites or G>A mutations in TG motifs at MC-2 sites in the non-transcribedstrand of the nucleic acid molecule is higher than expected; APOBEC3H ifthe number or percentage of observed G>A mutations in GA motifs at MC-1sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; or aflatoxin if the number or percentage ofobserved G>T mutations in GG motifs at MC-3 sites in the non-transcribedstrand of the nucleic acid molecule is higher than expected.

The biological sample may comprise breast, prostate, liver, colon,stomach, pancreatic, skin, thyroid, cervical, lymphoid, haematopoietic,bladder, lung, renal, rectal, ovarian, uterine, and head or neck tissueor cells, and, in some instances, the cancer is selected from amongbreast, prostate, liver, colon, stomach, pancreatic, skin, thyroid,cervical, lymphoid, haematopoietic, bladder, lung, renal, rectal,ovarian, uterine, and head and neck cancer. In particular examples, thecancer hepatocellular carcinoma, melanoma or adenoid cystic carcinoma.

In some embodiments of the present invention, if the sample comprisesprostate tissue or cells, the subject is diagnosed with prostate canceror determined to be likely to have or develop cancer. In otherembodiments, if the sample comprises breast tissue or cells, the subjectis diagnosed with breast cancer or determined to be likely to have ordevelop breast cancer.

The methods of the present invention may further include administeringtherapy to the subject, such as, for example, radiotherapy, surgery,chemotherapy, hormone ablation therapy, pro-apoptosis therapy and/orimmunotherapy. In particular examples, the methods include administeringan AID inhibitor; an APOBEC3G inhibitor; an APOBEC1 inhibitor and/or anAPOBEC3H inhibitor to the subject.

In another aspect, the present invention is directed to methods foridentifying a nucleic acid motif targeted by a mutagenic agent,comprising analyzing the sequence of a nucleic acid molecule to identifysomatic mutations of a mutation type known to be associated with themutagenic agent; determining the codon context of the mutations toidentify the preferred nucleotide position at which the mutations occurat a higher than expected frequency; and identifying the nucleotidesflanking the mutations at the preferred nucleotide position so as toidentify a motif that is common to the mutations.

The invention is also directed to methods for identifying a nucleic acidmotif targeted by a mutagenic agent, comprising analyzing the sequenceof a nucleic acid molecule to identify somatic mutations in the nucleicacid molecule; identifying a mutation type that occurs at a preferrednucleotide position within a codon at a higher than expected frequency;and identifying the nucleotides flanking the mutation type at thepreferred nucleotide position so as identify a motif that is common tothe mutation type.

The mutation type can be selected from C>T, C>A, C>G, G>T, G>A, G>C,A>T, A>C, A>G, T>A, T>C and T>G mutations, and the preferred nucleotideposition may be selected from among MC-1, MC-2 and MC-3.

In such methods, the expected frequency is calculated by assuming thatmutations occur independently of codon context. For example, theexpected frequency may be approximately 1 of every 9 mutations or 1 ofevery 6 mutations. In some embodiments, the mutation occurs at thepreferred nucleotide position a least 30%, 35%, 40%, 45%, 50%, 55%, 60%,65%, 70%, 80%, 85%, 90%, 95% or more of the time.

In some embodiments of the methods of the present invention, thenon-transcribed strand of the nucleic acid molecule is analyzed.

The mutagenic agent may be endogenous or exogenous to the cells fromwhich the nucleic acid was obtained. For example, the mutagenic agentmay be selected from among 4-aminobiphenyl, aristolochic acids, arseniccompounds, asbestos, azathioprine, benzene, benzidine, beryllium andberyllium compounds, 1,3-butadiene, 1,4-butanediol dimethylsulfonate,cadmium and cadmium compounds, chlorambucil,1-(2-chloroethyl)-3-(4-methylcyclohexyl)-1-nitrosourea (MeCCNU),bis(chloromethyl) ether and technical-grade chloromethyl methyl ether,chromium hexavalent compounds, coal tar pitches, coal tars, coke ovenemissions, cyclophosphamide, cyclosporin A, diethylstilbestrol (DES),erionite, ethylene oxide, formaldehyde, melphalan, methoxsalen withultraviolet A therapy (PUVA), mustard gas, 2-naphthylamine, neutrons,nickel compounds, radon, crystalline silica (respirable size), solarradiation, soot, strong inorganic acid mists containing sulfuric acid,tamoxifen, 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD), thiotepa, thoriumdioxide, tobacco smoke, vinyl chloride, ultraviolet radiation, wooddust, X-radiation, gamma radiation, an APOBEC cytidine deaminase, and anerror-prone DNA polymerase. In particular examples of the methods of theinvention, the nucleic acid molecule or the cell from which the nucleicacid molecule was obtained is known to have been exposed to themutagenic agent prior to analysis.

Embodiments of the methods of the invention may also comprise firstisolating the nucleic acid molecule and/or sequencing all or a part ofthe nucleic acid molecule. The nucleic acid molecule can comprise all orpart of a single gene or the cDNA of a single gene; or all or part oftwo or more genes or the cDNA of two or more genes. In some instances,the gene is a gene associated with cancer. For example, the gene may beselected from among TP53, PIK3CA, ERBB2, DIRAS3, TET2 and nitric oxidesynthase (NOS) genes. In further embodiments, nucleic acid molecule thatconstitute the whole exome of a cell or the whole genome of a cell areanalyzed.

The invention is also directed to a kit, comprising a reagent for use ina methods described herein. The reagent may be selected from, forexample, among a primer, dNTPs and polymerase.

In particular embodiments of the methods of the present invention, allor a part of the method is performed by a processing system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic showing targeted somatic mutation in a region ofinterest on the non-transcribed strand of a nucleic acid molecule.

FIG. 2 is a schematic showing an exemplary process of analysis ofnucleic acid molecules to determine whether targeted somatic mutagenesisby AID or AOPBEC3G has occurred.

FIG. 3 is a schematic showing an analysis performed to determine whethermutations occur at random or as a result of targeted somaticmutagenesis.

FIG. 4 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1)and WAsites in the TP53 gene of nucleic acid obtained from subjects withcervical cancer, and statistical analysis of the occurrence of themutations.

FIG. 5 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1) andWA sites in the TP53 gene of nucleic acid obtained from subjects withcolon adenocarcinoma, and statistical analysis of the occurrence of themutations.

FIG. 6 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1), WAsites, and GG sites (aflatoxin) in the TP53 gene of nucleic acidobtained from subjects with hepatocellular carcinoma, and statisticalanalysis of the occurrence of the mutations.

FIG. 7 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1) andWA sites in the TP53 gene of nucleic acid obtained from subjects withpancreatic cancer, and statistical analysis of the occurrence of themutations.

FIG. 8 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1) andWA sites in the TP53 gene of nucleic acid obtained from subjects withprostate cancer, and statistical analysis of the occurrence of themutations.

FIG. 9 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites in the TP53gene of nucleic acid obtained from subjects with malignant melanoma, andstatistical analysis of the occurrence of the mutations.

FIG. 10 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1) andWA sites in the TP53 gene of nucleic acid obtained from subjects withcervical adenocarcinoma, and statistical analysis of the occurrence ofthe mutations.

FIG. 11 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites in the NOS geneof nucleic acid obtained from subjects with cervical adenocarcinoma, andstatistical analysis of the occurrence of the mutations.

FIG. 12 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites in the PIK3CAgene of nucleic acid obtained from subjects with breast cancer, andstatistical analysis of the occurrence of the mutations.

FIG. 13 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G), TG/CA sites (APOBEC1) andWA sites in the TET2 gene of nucleic acid obtained from subjects withhaematopoietic and lymphoid cancer, and statistical analysis of theoccurrence of the mutations.

FIG. 14 shows the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites in the wholeexome of tissue obtained from two subjects with adenoid cysticcarcinoma, and statistical analysis of the occurrence of the mutations.(A) Subject PD3185a. (B) Subject PD3181a.

FIG. 15A and FIG. 15B show the frequency and location within codons ofmutations at GYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites inthe whole exome of tissue obtained from four subjects with prostatecarcinoma, and statistical analysis of the occurrence of the mutations.(A) Subject WA7. (B) Subject WA26. (C) Subject PR-09-3421. (D) SubjectPR-2762.

FIG. 16 shows the frequency and location within codons of mutations atGA sites (APOBEC3H) in the whole exome of nucleic acid obtained from asubjects with bladder cancer, and statistical analysis of the occurrenceof the mutations.

FIG. 17 shows the frequency and location within codons of mutations atCC sites (APOBEC3G) in the whole exome of nucleic acid obtained from 8subjects with bladder cancer (A), and a single subject with bladdercancer (B), and statistical analysis of the occurrence of the mutations.

FIG. 18 is a schematic of the process of detecting targeted somaticmutagenesis in a nucleic acid molecule using a processing system.

TABLE A NUCLEOTIDE SYMBOLS SYMBOL DESCRIPTION A Adenosine C Cytidine GGuanosine T Thymidine U Uridine M Amino (adenosine, cytosine) K Keto(guanosine, thymidine) R Purine (adenosine, guanosine) Y Pyrimidine(cytosine, thymidine) W Adenosine or cytosine N Any nucleotide

DETAILED DESCRIPTION 1. Definitions

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by those of ordinary skillin the art to which the invention belongs. Although any methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of the present invention, preferred methods andmaterials are described. For the purposes of the present invention, thefollowing terms are defined below.

The articles “a” and “an” are used herein to refer to one or to morethan one (i.e. to at least one) of the grammatical object of thearticle. By way of example, “an element” means one element or more thanone element.

The term “biological sample” as used herein refers to a sample that maybe extracted, untreated, treated, diluted or concentrated from a subjector patient. Suitably, the biological sample is selected from any part ofa patient's body, including, but not limited to hair, skin, nails,tissues or bodily fluids such as saliva and blood.

As used herein, the term “codon context” with reference to a mutationrefers to the nucleotide position within a codon at which the mutationoccurs. For the purposes of the present invention, the nucleotidepositions within a mutated codon (MC; i.e. a codon containing themutation) are annotated MC-1, MC-2 and MC-3, and refer to the first,second and third nucleotide positions, respectively, when the sequenceof the codon is read 5′ to 3′. Accordingly, the phrase “determining thecodon context of a mutation” or similar phrase means determining atwhich nucleotide position within the mutated codon the mutation occurs,i.e. MC-1, MC-2 or MC-3.

Throughout this specification, unless the context requires otherwise,the words “comprise,” “comprises” and “comprising” will be understood toimply the inclusion of a stated step or element or group of steps orelements but not the exclusion of any other step or element or group ofsteps or elements.

By “gene” is meant a unit of inheritance that occupies a specific locuson a genome and comprises transcriptional and/or translationalregulatory sequences and/or a coding region and/or non-translatedsequences (i.e., introns, 5′ and 3′ untranslated sequences).

As used herein, the term “likelihood” is used as a measure of whethertargeted somatic mutagenesis has occurred, whether a particularmutagenic agent is a cause of targeted somatic mutagenesis and ofwhether subjects with nucleic acid containing targeted somatic mutationshas or will develop cancer based on a given mathematical model. Anincreased likelihood for example may be relative or absolute and may beexpressed qualitatively or quantitatively. For instance, an increasedlikelihood or risk that a subject will develop cancer may be expressedas simply determining the number of targeted somatic mutations (astaught herein) and placing the test subject in an “increased likelihoodor risk” category, based upon previous population studies.

In some embodiments, the methods comprise comparing the number orpercentage of targeted somatic mutations to a preselected or thresholdnumber or percentage. Thresholds may be selected that provide anacceptable ability to predict diagnosis, likelihood or prognostic risk.In illustrative examples, receiver operating characteristic (ROC) curvesare calculated by plotting the value of a variable versus its relativefrequency in two populations in which a first population has a firstcondition or risk and a second population has a second condition or risk(called arbitrarily, for example, “healthy condition” and “cancer”, or“low risk” and “high risk”).

A distribution of number of mutations for subjects with and without adisease will likely overlap. Under such conditions, a test does notabsolutely distinguish a first condition and a second condition with100% accuracy, and the area of overlap indicates where the test cannotdistinguish the first condition and the second condition. A threshold isselected, above which the test is considered to be “positive” and belowwhich the test is considered to be “negative.” The area under the ROCcurve (AUC) provides the C-statistic, which is a measure of theprobability that the perceived measurement will allow correctidentification of a condition (see, e.g., Hanley et al., Radiology 143:29-36 (1982). The term “area under the curve” or “AUC” refers to thearea under the curve of a receiver operating characteristic (ROC) curve,both of which are well known in the art. AUC measures are useful forcomparing the accuracy of a classifier across the complete data range.Classifiers with a greater AUC have a greater capacity to classifyunknowns correctly between two groups of interest (e.g., a healthycondition mutation status and a cancer mutation status). ROC curves areuseful for plotting the performance of a particular feature indistinguishing or discriminating between two populations (e.g., caseshaving a cancer and controls without the cancer). Typically, the featuredata across the entire population (e.g., the cases and controls) aresorted in ascending order based on the value of a single feature. Then,for each value for that feature, the true positive and false positiverates for the data are calculated. The sensitivity is determined bycounting the number of cases above the value for that feature and thendividing by the total number of cases. The specificity is determined bycounting the number of controls below the value for that feature andthen dividing by the total number of controls. Although this definitionrefers to scenarios in which a feature is elevated in cases compared tocontrols, this definition also applies to scenarios in which a featureis lower in cases compared to the controls (in such a scenario, samplesbelow the value for that feature would be counted). ROC curves can begenerated for a single feature as well as for other single outputs, forexample, a combination of two or more features can be mathematicallycombined (e.g., added, subtracted, multiplied, etc.) to produce a singlevalue, and this single value can be plotted in a ROC curve.Additionally, any combination of multiple features (e.g., one or moreother epigenetic markers), in which the combination derives a singleoutput value, can be plotted in a ROC curve. These combinations offeatures may comprise a test. The ROC curve is the plot of thesensitivity of a test against the specificity of the test, wheresensitivity is traditionally presented on the vertical axis andspecificity is traditionally presented on the horizontal axis. Thus,“AUC ROC values” are equal to the probability that a classifier willrank a randomly chosen positive instance higher than a randomly chosennegative one. An AUC ROC value may be thought of as equivalent to theMann-Whitney U test, which tests for the median difference betweenscores obtained in the two groups considered if the groups are ofcontinuous data, or to the Wilcoxon test of ranks.

Alternatively, or in addition, thresholds may be established byobtaining an earlier mutation status result from the same patient, towhich later results may be compared.

In these embodiments, the individual in effect acts as their own“control group.” In another embodiment, thresholds may be established byanalyzing the number targeted somatic mutations in nucleic acid fromnon-diseased or healthy tissue from a patient and comparing it toanalyzing the number targeted somatic mutations in nucleic acid fromdiseased or cancerous tissue.

The term “mutagenic agent” refers to an agent that can cause mutagenesisof DNA. Mutagenic agents include endogenous agents (i.e. agents that areendogenous to, or are produced by, the cell in which the DNA iscontained) and exogenous agents (i.e. agents that are exogenous to, ornot produced by, the cell in which the DNA is contained), and includefor example chemicals, proteins, enzymes, radiation and viruses.

As used herein, a “mutation type” refers to the specific nucleotidesubstitution that comprises the mutation, and is selected from amongC>T, C>A, C>G, G>T, G>A, G>C, A>T, A>C, A>G, T>A, T>C and T>G mutations.Thus, for example, a mutation type of C>T refers to a mutation in whichthe targeted or mutated nucleotide C is replaced with the substitutingnucleotide T.

The “nucleic acid” as used herein designates DNA, cDNA, mRNA, RNA, rRNAor cRNA. The term typically refers to polynucleotides greater than 30nucleotide residues in length.

The terms “patient” and “subject” are used interchangeably and refer topatients and subjects of human or other mammal and includes anyindividual it is desired to examine or treat using the methods of theinvention. However, it will be understood that “patient” does not implythat symptoms are present. Suitable mammals that fall within the scopeof the invention include, but are not restricted to, humans and otherprimates, livestock animals (e.g., sheep, cows, horses, donkeys, pigs),laboratory test animals (e.g., rabbits, mice, rats, guinea pigs,hamsters), companion animals (e.g., cats, dogs) and captive wild animals(e.g., foxes, deer, dingoes).

The term “somatic mutation” refers to a mutation in the DNA of somaticcells (i.e. not germ cells), occurring after conception. “Somaticmutagenesis” therefore refers to the process by which somatic mutationsoccur.

As used herein, “targeted somatic mutagenesis” refers to the process ofsomatic mutagenesis resulting from one or more mutagenic agents, whereinmutagenesis occurs at a targeted nucleotide within a motif, the targetednucleotide is present at a particular position within a codon (e.g. thefirst, second or third position of the mutated codon reading from 5′ to3′, annotated MC-1, MC-2 and MC-3, respectively), and the targetednucleotide is mutated to a particular substituting nucleotide (i.e. themutation is of a particular mutation type, e.g. C>T, not C>A or C>G).Thus, a determination that targeted somatic mutagenesis is occurringrequires analysis of the type of mutation (e.g. C>T), the motif at whichthe mutation occurs (e.g. WRC) and codon context of the mutation, i.e.the position within the codon at which the mutation occurs (e.g. MC-1,MC-2 or MC-3). “Targeted somatic mutagen” therefore refers to mutationresulting from targeted somatic mutagenesis.

As used herein, the terms “treatment,” “treating,” and the like, referto obtaining a desired pharmacologic and/or physiologic effect. Theeffect may be prophylactic in terms of completely or partiallypreventing a condition (such as cancer) or symptom thereof and/or may betherapeutic in terms of a partial or complete cure for a conditionand/or adverse affect attributable to the condition. “Treatment,” asused herein, covers any treatment of condition in a mammal, particularlyin a human, and includes: (a) preventing the condition from occurring ina subject which may be at risk of developing the condition but has notyet been diagnosed as having it; (b) inhibiting the condition, i.e.,arresting its development; and (c) relieving the condition, i.e.,causing regression of the condition.

As used herein, “whole exome” refers to all of the exons in the genome.Thus, analysis of the sequence of a whole exome from a cell refers toanalysis of the sequence of all of the exons in the genome from thecell.

2. Mutagenic Agents Involved in Somatic Mutagenesis

Both exogenous and endogenous factors can act as mutagenic agents thatcause or play a role in somatic mutagenesis. Exogenous factors include,but are not limited to, aflatoxins, 4-aminobiphenyl, aristolochic acids,arsenic compounds, asbestos, azathioprine, benzene, benzidine, berylliumand beryllium compounds, 1,3-butadiene, 1,4-butanediol dimethylsulfonate(busulfan, Myleran®), cadmium and cadmium compounds, chlorambucil,1-(2-chloroethyl)-3-(4-methylcyclohexyl)-1-nitrosourea (MeCCNU),bis(chloromethyl) ether and technical-grade chloromethyl methyl ether,chromium hexavalent compounds, coal tar pitches, coal tars, coke ovenemissions, cyclophosphamide, cyclosporin A, diethylstilbestrol (DES),erionite, ethylene oxide, formaldehyde, melphalan, methoxsalen withultraviolet A therapy (PUVA), mustard gas, 2-naphthylamine, neutrons,nickel compounds, radon, crystalline silica (respirable size), solarradiation, soots, strong inorganic acid mists containing sulfuric acid,tamoxifen, 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD), thiotepa, thoriumdioxide, tobacco smoke, vinyl chloride, ultraviolet radiation, wooddust, X-radiation and gamma radiation. Endogenous factors include, butare not limited to, activation-induced cytidine deaminase (AID),apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like(APOBEC) cytidine deaminases, and error-prone DNA polymerases, such asDNA polymerase-eta.

2.1 AID

Activation-induced cytidine deaminase (AID) is an important enzyme inadaptive immunity, involved in somatic hypermutation (SHM) and classswitch recombination of immunoglobulin genes in B cells. AID triggersSHM by deaminating cytidines to uracils (C>U) to diversify theimmunoglobulin variable region genes (VDJ) and create newantigen-binding sites.

It has been well recognized that the mutation patterns resulting fromthese SHM processes that give rise to new antigen-binding sites are notrandom. Clustering and hotspots of mutational activity influenced byneighboring base sequences have been identified, and the catalyticproperties and specific mutation spectra resulting from AID and itsinvolvement in AID-mediated DNA deamination of rearranged immunoglobulinvariable region genes are well documented.

The SHM process is currently considered to occur in two phases. In phase1, the gene encoding the AID protein is upregulated in germinal center Blymphocytes (Muramatsu et al. (2000) Cell 102: 553-563). AID thentargets mutations to G:C base pairs at the reverse complement hotspotsGYW/WRC (where Y=C/T, W=A /T, R=A/G; and the underlined nucleotidesconstitute the targeted base pair) by the direct deamination of cytidineto uracil (C>U) in the transcribed single stranded (ss) regions of theDNA exposed during transcription (Di Noia and Neuberger (2007) Annu RevBiochem. 76: 1-22; Teng and Papavasiliou (2007) Annu Rev Genet41:107-120). AID occupies the target cytidine before deamination(Bhutani et al. (2011) Cell 146: 866-872). The uracils in DNA are verymutagenic if left unrepaired and they activate a DNA base excisionrepair (BER) process involving uracil DNA glycosylase (UNG) causingapurinic apyrimidinic (AP), or ‘abasic’, sites, leading to ssDNA nicks(via an apurinic/apyrimidinic endonuclease activity, APE) and attractingfurther DNA patch repair activity (Peled et al. (2008) Ann Rev Immunol26: 481-511). Once UNG triggers the BER pathways to remove the uracils,the abasic site created can, at replication and repair, be replaced byany of the bases A, G, C or T.

The main strand bias mutation pattern associated with phase 1 ischaracterized by dominant C>T and G>A transitions, and with the totalnumber of mutations of G exceeding the number of C (Steele (2009) MolImmunol 46: 305-320). It has been deduced that the resulting strandbiased mutation pattern is consistent with the known mis-incorporationsignature of mammalian RNA polymerase II copying the template DNA strandcarrying AID lesions, uracils and AP sites (see e.g. Steele (2009) MolImmunol 46: 305-320).

In phase 2, the mutations are targeted to A:T base pairs predominantlyat WA-hotspot motifs and are distinctly strand-biased with mutations ofA exceeding mutations of T by 2-3 fold (see e.g. Steele (2009) MolImmunol 46: 305-320). In phase 2, G:U mispairs recruit the binding ofthe mismatch DNA repair heterodimer MSH2-MSH6 complex, which in turnrecruits the error-prone Y family translesion protein DNA polymerase-etatargeting mutations in a short patch error-prone DNA repair process toboth WA-sites and some other sequence stretches in the VDJ targetsequence region.

Several studies suggest the possibility that aberrant AID-initiated SHMprocesses might result in the conversion of C>U in DNA outside of thegerminal center environment, and thus contribute to oncogenesis in othergenes (Beale et al. (2004) J Mol. Biol 337: 585-594; Marusawa H. (2008)Int J Biochem Cell Biol 40: 1399-1402). SHM-like activity has been foundto occur in a range of genes such as BCL-6 in human tonsillar B cells(Yavuz et al. (2002) Mol Immunol 39: 485-493), the CD5/4, PIM 1 and CMYCgenes in T-lymphomas (Kotani et al. (2005) PNAS 102: 4506-4511), andBCL-6 and C-MYC in B-lymphomas (Nilsen et al. (2005) Oncogene 24:3063-3066). AID-initiated SHM activity has also been investigated as apotential source of TP53 mutations in a number of studies. In one suchstudy, mutation targeting in TP53 in B-cell chronic lymphocyticleukaemia (B-CLL) was found to exhibit the characteristic traits of theSMH process (Malcikova et al. (2008) Molecular Immunology 45: 1525-9).Although the number of mutations was low for the two patients observed,the data reveal a significant bias to point mutations at CG pairs, and asignificant preference for the RGYW/WRCY motifs (28% and 44% in thefirst and second patients, respectively). In the second patient, it wasfound that 6/8 point mutations affecting A:T pairs were localized atWA/TW motifs, which are a hallmark characteristic of the SHM singlepoint mutation spectrum. A high expression of AID transcript was foundin the first patient, but not in the second who was alreadyIgVH-mutated. As shown herein and described in Lindley and Steele (ISRNGenomics (2013) 921418) and Lindley (Cancer Genet. (2013) 206(6):222-6),strand-biased SHM-like mutation processes appear closely associated withcancer.

There are also examples of infectious agents that actively induce AIDexpression and result in a TP53 mutation pattern that is consistent withthe known characteristics of SHM activity in Ig genes. Examples includehepatitis C virus (Machida et al. (2004) Proc Natl Acad Sci U.S.A. 101:4262-4267), Epstein Barr virus (Epeldegui et al. (2007) Mol. Immunol 44:934-942) and Helicobacter pylori (Matsumoto et al. (2007) Nat Med. 13:470-476). AID has been linked to B cell tumorogenesis and other cancers(Honjo et al. (2012) Adv Cancer Res. 2012;113:1-44), and transgenicexpression of AID causes tumor formation in mice (Okazaki et al. (2003)J Exp Med 197: 1173-1181).

2.2 APOBEC Cytidine Deaminases

In addition to AID, the human genome encodes several homologous APOBECcytidine deaminases that are known to be involved in innate immunity andRNA editing (Smith et al. (2012) Semin. Cell. Dev. Biol. 23:258-268). Inhumans, at least APOBEC1, APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3D,APOBEC3F, APOBEC3G and APOBEC3H are involved in providing innateimmunity and/or cellular mRNA editing.

For example, APOBEC1 is responsible for ApoB pre-mRNA editing, where itcauses deamination of cytidine 6666 to change a glutamine codon into astop codon, thus generating a shorter form of ApoB (ApoB48). APOBEC1 canalso deaminate cytidine in DNA (Harris et al. (2002) Mol Cell.10:1247-1253; Petersen-Mahrt and Neuberger (2003) J BiolChem.278:19583-19586). The APOBEC3 enzymes deaminate mobile geneticelements (i.e. endogenous retroelements and exogenous viruses), mutatingthe DNA as a form of innate immunity. For example, APOBEC3G acts on HIVand other retroviruses (e.g. simian immunodeficiency virus (SIV), equineinfectious anemia virus (EIAV), murine leukemia virus (MLV), and foamyvirus (FV)) to mutate the minus-strand DNA during reverse transcription.Other APOBEC3 enzymes have also been shown to act on HIV and otherretroviruses, as well as hepatitis B virus, parvovirus and AAV-2(reviewed in Smith et al. (2012) Sem Cell Dev Biol 23:258-268).

Like AID, the APOBEC cytidine deaminases have been implicated inoncogenesis. For example, transgenic expression of APOBEC1 causes tumorformation in mice (Yamanaka et al. (1995) PNAS 92:8483-8487); highexpression of APOBEC3B leads to somatic mutation in tumor-associatedgenes (Shinohara et al. (2012) Scientific Reports 2: 806); APOBEC3B isupregulated in at least breast, bladder, cervix (adenocarcinoma andsquamous cell carcinoma), and head and neck cancer, with an associatedincrease in mutations at APOBEC3B motifs (Burns et al (2013) Nature 494:366-370; Bums et al. (2013) Nature Genetics 45:977-983); and APOBECenzyme mutation signatures have been shown to be widespread in a varietyof cancers.

A study comparing targeting preferences for AID, APOBEC1 and APOBEC3Gusing a bacterial mutation assay demonstrated the critical importance ofnucleotides immediately 5′ and 3′ of the targeted C for thespecificities of the cytidine deaminases (Beale et al. (2004) J Mol.Biol 337: 585-594). While APOBEC3G can only deaminate cytidines onssDNA, APOBEC1 can edit cytidines on DNA or dsRNAs. It was observed that79% of transitions in the presence of APOBEC1 were associated with a 5′T, thus implying a motif of TG/CA for APOBEC1. The APOBEC3G motif issuggested as being CG/CG and/or CC (Beale et al. (2004) J Mol. Biol 337:585-594; and Rathmore et al (2013). J. Mol. Biology 425(22):4442-54).Other studies indicate that other APOBEC enzymes, such as APOBEC3A,APOBEC3B and APOBEC3F have a TC motif, or a more stringent TM motif(where W corresponds to either A or T) (Bishop et al. (2004) Curr Biol.14:1392-1396; Thielen et al. (2010) J Biol Chem 285:27753-27766; Henryet al. (2009) PLoS One. 4:e4277; Shinohara et al. (2012) ScientificReports 2: 806; Burns et al. (2013) Nature Genetics 45:977-983).APOBEC3H has been suggested to target a GA/TC motif

3. Methods for Detecting Targeted Somatic Mutagenesis

As demonstrated herein, some mutagenic agents not only cause mutagenesisof a nucleotide at one or more particular motifs, but the motif andmutated nucleotide are recognized within the codon context, i.e. themutated nucleotide is at a particular position within the codonstructure, such as the first, second or third nucleotide in the mutatedcodon (read 5′ to 3′). There is also a clear preference for thereplacement or substituting nucleotide. This combination ofmotif-specific, and codon context-specific targeting by mutagenic agentsis termed herein targeted somatic mutagenesis. By way of a non-limitingexample, and as shown in FIG. 1, mutation of A at a WA motif in thenon-transcribed strand of a nucleic acid molecule may preferentiallyoccur at the first position of the mutated codon (MC-1) and be amutation to C (i.e. A>C). Thus, the likelihood of whether or nottargeted somatic mutation of a nucleic acid molecule has occurred can bedetermined by analyzing the sequence of a nucleic acid molecule todetermine the codon context of mutations of a mutation type (e.g. A>C)at one or more particular motifs (e.g. a WA motif). If there is no codonbias in the location of the mutations of the mutation type at the motif(i.e. the mutations are essentially evenly distributed across eachposition in the codons), then it is most likely that the mutations aroseby chance and not as a result of targeted somatic mutagenesis by amutagenic agent. However, if there is a higher than expected percentageor number of mutations of the mutation type at one particular positionin codons (e.g. MC-1, MC-2 or MC-3 sites) in the nucleic acid molecule,then this indicates that targeted somatic mutagenesis has occurred or islikely to have occurred.

The “expected number or percentage” of the mutations described above isthe number or percentage of mutations expected if the mutations areindependent of other mutations and codon context, i.e. the distributionof mutations at each targeted nucleotide in each position in the codonis essentially even. Thus, for example, when assessing mutations arisingacross MC-1, MC-2 and MC-3 positions or sites, it would be expected thatmutation of a nucleotide (e.g. A) to any one of the other threenucleotides (e.g. G, C or T) at any one of the three site (e.g. MC-1,MC-2 or MC-3) would occur as 1 in every 9 mutations (i.e. 1 in 3 chanceof A to any one of G, C or T, and a 1 in 3 chance at any site, equalinga 1 in 9 chance overall) or approximately 11% of the time. Whenassessing mutations arising across just two of the nucleotide positionsin the mutated codon, such as the MC-1 and MC-2 sites, it would beexpected that mutation of a nucleotide (e.g. A) to any one of the othernucleotides (e.g. G, C or T) at either of the two sites (e.g. MC-1 orMC-2), would occur as 1 in every 6 mutations, or approximately 17% ofthe time (i.e. 1 in 3 chance of A to any one of G, C or T, and a 1 in 2chance at any site, equaling a 1 in 6 chance overall). Similarly, whenassessing mutations arising across just one of the sites (e.g. MC-1), itwould be expected that mutation of a nucleotide (e.g. A) to any one ofthe other nucleotides (e.g. G, C or T) would occur as 1 in every 3mutations, or approximately 33% of the time.

This is illustrated in FIG. 2, where the prevalence of C>T mutations atMC-1 sites (i.e. at the first nucleotide position within the mutatedcodon) is assessed to determine whether targeted somatic mutation hasoccurred or whether the observed mutations arise randomly. If mutationof cytosines across MC-1 and MC-2 sites at a WRC motif is random, thenit would be expected that the type and position of the mutations isevenly distributed, and that a C>T mutation at MC-2 occurs once in everysix times (or approximately 17%), with the other 5 mutations being C>Aat MC-1, C>A at MC-2, C>G at MC-1, C>G at MC-2, and C>T at MC-1. In theparticular example shown in FIG. 2, there are a total of 82 mutations ofa cytosine at MC-1 or MC-2 sites at a WRC motif. If the mutagenesis wasrandom, it would be expected that one sixth (or 17%) of these would beC>T mutations at MC-2 sites, equivalent to about 14 occurrences.However, in this example, there are 72 observed C>T mutations at MC-2sites, indicating that targeted somatic mutagenesis of the nucleic acidhas occurred.

Typically, when targeted somatic mutagenesis occurs as a result of theactivity of one or more mutagenic agents and an assessment is madeacross the three sites of the codon (e.g. MC-1, MC-2 and MC-3), theparticular mutations that are associated with the mutagenic agent areobserved at least or about 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%,65%, 70%, 75%, 80%, 85%, 90%, 95% or more of the time. When anassessment is made across at two sites (e.g. MC-1 and MC-2; MC-1 andMC-3; or MC-2 and MC-3), the particular mutations that are associatedwith the mutagenic agent are typically observed at least or about 30%,35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% or moreof the time. When an assessment is made across only one site (e.g. MC-1;MC-2; or MC3), the particular mutations that are associated with themutagenic agent are typically observed at least or about 50%, 55%, 60%,65%, 70%, 75%, 80%, 85%, 90%, 95% or more of the time.

By assessing the type of mutation at a particular motif (e.g. C>Tmutations at a WRC motif) as well as the codon context of the mutation(e.g. whether the mutation is at an MC-2 site), a more accurateassessment of the activity of the mutagenic agent can be made comparedto when only the mutation at the motif is assessed and codon context isnot factored in. Accordingly, using the methods described herein, thelikelihood that a particular mutagenic agent or a mutagenic process,such as an AID-associated mutagenic process, is a cause of targetedsomatic mutagenesis of a nucleic acid molecule can be assessed byanalyzing the sequence of the nucleic acid molecule to determine thecodon context of mutations at motif(s) targeted by the mutagenic agentor mutagenic process.

3.1 Targeted Somatic Mutagenesis by AID, APOBEC1, APOBEC3G, APOBEC3H andAflatoxin

As described above, AID is known to target the motif GYW/WRC, whereinthe underlined nucleotide is mutated. As demonstrated herein, there is asignificant preference for targeting of the G to occur at MC-2 sites,resulting in G>A mutations. Accordingly, a higher than expected numberor percentage of G>A mutations at GYW motifs at MC-2 sites in thenon-transcribed strand of a nucleic acid molecule indicates that AID isa likely cause of targeted somatic mutagenesis of the nucleic acid, andthat AID is active in the cells and/or tissue from which the nucleicacid was obtained. As also demonstrated herein, there is a significantpreference for targeting of the C to occur at MC-1 sites, resulting inC>T mutations. Accordingly, a higher than expected number or percentageof C>T mutations at WRC motifs at MC-1 sites in the non-transcribedstrand of a nucleic acid molecule indicates that AID is a likely causeof targeted somatic mutagenesis of the nucleic acid, and that AID isactive in the cells and/or tissue from which the nucleic acid wasobtained.

APOBEC3G is known to target CG/CG motifs. The studies described hereindemonstrate that there is a significant preference for targeting of theG to occur at MC-2 sites, resulting in G>A mutations. Accordingly, ahigher than expected number or percentage of G>A mutations at CG motifsat MC-2 sites in the non-transcribed strand of a nucleic acid moleculeindicates that APOBEC3G is a likely cause of targeted somaticmutagenesis of the nucleic acid, and that APOBEC3G is active in thecells and/or tissue from which the nucleic acid was obtained. There isalso a significant preference for targeting of the C to occur at MC-1sites, resulting in C>T mutations. Accordingly, a higher than expectednumber or percentage of C>T mutations at CG motifs at MC-1 sites in thenon-transcribed strand of a nucleic acid molecule indicates thatAPOBEC3G is a likely cause of targeted somatic mutagenesis of thenucleic acid, and that APOBEC3G is active in the cells and/or tissuefrom which the nucleic acid was obtained.

APOBEC3G is also known to target CC motifs. The studies described hereindemonstrate that there is a significant preference for targeting of theC to occur at MC-1 sites, resulting in C>T mutations. Accordingly, ahigher than expected number or percentage of C>T mutations at CC motifsat MC-1 sites in the non-transcribed strand of a nucleic acid moleculeindicates that APOBEC3G is a likely cause of targeted somaticmutagenesis of the nucleic acid, and that APOBEC3G is active in thecells and/or tissue from which the nucleic acid was obtained.

APOBEC1 preferentially targets TG/CA motifs in nucleic acid molecules.Furthermore, there is a significant preference for targeting of the C atthe CA motif to occur at MC-1 sites, resulting in C>T mutations.Accordingly, a higher than expected number or percentage of C>Tmutations at CA motifs at MC-1 sites in the non-transcribed strand of anucleic acid molecule indicates that APOBEC1 is a likely cause oftargeted somatic mutagenesis of the nucleic acid, and that APOBEC1 isactive in the cells and/or tissue from which the nucleic acid wasobtained. There is also a preference for targeting of the G at the TGmotif to occur at MC-2 sites, resulting in G>A mutations. Accordingly, ahigher than expected number or percentage of G>A mutations at TG motifsat MC-2 sites in the non-transcribed strand of a nucleic acid moleculemay indicate that APOBEC1 is a likely cause of targeted somaticmutagenesis of the nucleic acid, and that APOBEC1 is active in the cellsand/or tissue from which the nucleic acid was obtained

Somatic mutations at a WA motif are known to occur in phase 2 of theAID-associated SHM process in germinal center B cells, and are thusindicative of AID-associated mutations processes and, by extension, maybe indicative of AID activity. As demonstrated herein, there is apreference for targeting of the A to occur at MC-2 sites, resulting inA>T mutations. Accordingly, a higher than expected number or percentageof A>T mutations at WA motifs at MC-2 sites in the non-transcribedstrand of a nucleic acid molecule indicates that AID-associated somaticmutation processes are active in the cells and/or tissue from which thenucleic acid was obtained, and that AID may also be active in the cellsand/or tissue. A determination that an AID-associated mutation processis a likely cause of targeted somatic mutagenesis can also made if thenumber or percentage of observed G>A mutations in GYW motifs at MC-2sites, or C>T mutations in WRC motifs at MC-1 sites, which arerepresentative of AID activity, can also be made.

Aflatoxin is associated with G>T transversions at the third position ofcodon 249 in TP53. It has been determined herein that there is apreference for targeting the G within a GG motif, wherein the targetednucleotide is at a MC-3 site. Accordingly, a higher than expected numberor percentage of G>T mutations at GG motifs at MC-3 sites in thenon-transcribed strand of a nucleic acid molecule indicates thataflatoxin is a likely cause of targeted somatic mutagenesis of thenucleic acid. In particular examples, the aflatoxin is aflatoxin B1. Inother examples, the aflatoxin is aflatoxin B2, G1, G2, M1 or M2.

3.2 Identifying Motifs for Other Mutagenic Agents

As clearly demonstrated herein, mutagenic agents may target a nucleotidein a motif within a particular codon context. Thus, targeted somaticmutation by such agents generally results in one type of mutation (e.g.C>T, and not C>G or C>A), at one position within the codon structure(e.g. MC-1 and not MC-2 or MC-3) and at one motif (e.g. CG). Byanalyzing nucleic acid sequences for the particular mutation type at themotif and within a particular codon context, as described above, a moreaccurate indication of the activity of a mutagenic agent can be obtainedthan if just the incidence of mutations at the motif were to beexamined.

This bias for codon context can be used to identify motifs for othermutagenic agents. By analyzing a nucleic acid molecule for the incidenceof somatic mutations of a mutation type known to be associated with amutagenic agent (e.g. G>T), and also assessing the codon context of themutations and the nucleotides flanking the mutation, the motif for themutagenic agent may be identified. When a particular mutation (e.g. G>T)occurs at a particular position within a codon (e.g. MC-3) morefrequently than would occur at random, i.e. there is a preferrednucleotide position at which the mutation occurs, then it is likely thatthe mutations at this position occur as a result of targeted somaticmutation by the mutagenic agent. By analyzing the nucleotides flankingthe mutation at the preferred nucleotide position (e.g. MC-3), any motifcommon to the mutations and thus targeted by the mutagenic agent can beidentified.

This is demonstrated in Example 7 below. A G>T transversion at the thirdposition of codon 249 in TP53 has previously been linked to aflatoxin.When nucleic acid from a whole-exome sample from a subject withhepatocellular carcinoma was analyzed for G>T mutations, it was observedthat there were 9 G>T mutations at MC-3 sites, and each mutation wasco-incident with another G immediately 5′ of the mutated G, suggestingthat aflatoxin targets GG motifs, wherein the targeted (underlined) G isat an MC-3 site, to cause G>T mutations.

Thus, the present invention also provides methods for identifying amotif targeted by a mutagenic agent. The methods involve analyzing thesequence of a nucleic acid molecule to determine whether a mutation typeassociated with the mutagenic agent predominantly occurs at one positionor site of a codon (e.g. MC-1, MC-2 or MC-3). If there is a co-incidenceof mutation type and site, then the nucleotides flanking the mutatednucleotide are identified so as to identify a common motif that includesthe mutated nucleotide. More specifically, the methods involve analyzingthe sequence of a nucleic acid molecule to identify somatic mutations ofa mutation type known to be associated with the mutagenic agent,determining the codon context of the mutations to identify a preferrednucleotide position at which the mutations occur at a higher thanexpected frequency, and identifying the nucleotides flanking themutations at the preferred nucleotide position so as to identify a motifthat is common to the mutations.

A similar process can also be applied when the mutation type associatedwith a mutagenic agent is not yet known. In such cases, the sequence ofa nucleic acid molecule is first analyzed to identify somatic mutations,and any mutation type (e.g. G>T) that occurs at a position within acodon (e.g. MC-3) at a frequency that is higher than expected if themutation occurred randomly (i.e. at a preferred nucleotide position) arealso identified. The sequence flanking the mutation at the preferrednucleotide position is then assessed to determine whether there is amotif that is common to the mutation. If there is, this motif is likelythe target of the mutagenic agent.

In other examples, known motifs of mutagenic agents can be furtheranalyzed to determine the codon bias and preference for a mutation type.Nucleic acid sequences can be assessed as described herein, such as inExample 1, to determine the codon context and mutation type associatedwith mutations at the motif so as to assess whether there is apreference for a mutation type at a nucleotide position in the codon.For example, APOBEC3A, APOBEC3B, APOBEC3F and APOBEC3H are thought totarget a TC motif, or a more stringent TCW motif The sequence of one ormore nucleic acid molecules can be analyzed to determine the codoncontext in which mutations at the motif occur, i.e. whether the C is atMC-1, MC-2 or MC-3, and what type of mutation occurs, (e.g. C>A, C>T, orC>G). Once the co-incident mutation type, motif and codon context areidentified, this set of criteria, or diagnostic rule, can be used tomore accurately determine whether APOBEC3A, APOBEC3B, APOBEC3F orAPOBEC3H (or other mutagenic agent) is the likely cause of targetedsomatic mutagenesis in a nucleic acid molecule and is thus active in thecells from which the nucleic acid was obtained.

To identify motifs and/or diagnostic rules using the methods describedabove, the nucleic acid that is analyzed is typically nucleic acid thatis known or suspected to have been in contact with the mutagenic agentor is nucleic acid that has been obtained from cells that are known orsuspected to have been in contact with the mutagenic agent. For example,cells comprising the nucleic acid may be exposed in vitro to themutagenic agent before nucleic acid is analyzed. In other examples, thenucleic acid may be obtained from tissue or cells from subjects that areknown to have been exposed to the mutagenic agent. Multiple studiesusing multiple samples may be performed to validate the findings.

3.3 Assessing the Nucleic Acid Molecule

Any method known in the art for obtaining and assessing the sequence ofa nucleic acid molecule can be used in the methods of the presentinvention. The nucleic acid molecule analyzed using the methods of thepresent invention can be any nucleic acid molecule, although isgenerally DNA (including cDNA). Typically, the nucleic acid is mammaliannucleic acid, such as human nucleic acid. The nucleic acid can beobtained from any biological sample. For example, the biological samplemay comprise blood, tissue or cells. In some examples, the biologicalsample is a biopsy. Moreover, the sample may from any part of the bodyand may comprise any type of cells or tissue, such as, for example,breast, prostate, liver cells, colon, stomach, pancreatic, skin,thyroid, cervical, lymphoid, haematopoietic, bladder, lung, renal,rectal, ovarian, uterine, and head or neck tissue or cells, or cellsfrom cerebrospinal fluid. In some instances, the nucleic acid isobtained from a cell or tissue sample from a subject suspected of or atrisk of having cancer, or is obtained from a cell or tissue sample froma subject that has cancer.

The nucleic acid molecule can contain a part or all of one gene, or apart or all of two or more genes, and it is the sequence of this gene orgenes that is analyzed according to the methods of the invention. Forexample, the nucleic acid molecules may comprise all or part of theTP53, PIK3CA, ERBB2, DIRAS3, TET2 or nitric oxide synthase (NOS) genes.In some instances, the nucleic acid molecule comprises the whole genomeor whole exome, and it is the sequence of the whole genome or wholeexome that is analyzed in the methods of the invention.

When using the methods of the present invention, the sequence of thenucleic acid molecule may have been predetermined. For example, thesequence may be stored in a database or other storage medium, and it isthis sequence that is analyzed according to the methods of theinvention. In other instances, the sequence of the nucleic acid moleculemust be first determined prior to employment of the methods of theinvention. In particular examples, the nucleic acid molecule must alsobe first isolated from the biological sample.

Methods for obtaining nucleic acid and/or sequencing the nucleic acidare well known in the art, and any such method can be utilized for themethods described herein. In some instances, the methods includeamplification of the isolated nucleic acid prior to sequencing, andsuitable nucleic acid amplification techniques are well known to aperson of ordinary skill in the art. Nucleic acid sequencing techniquesare well known in the art and can be applied to single or multiplegenes, or whole exomes or genomes. These techniques include, forexample, capillary sequencing methods that rely upon ‘Sanger sequencing’(Sanger et al. (1977) Proc Natl Acad Sci USA 74: 5463-5467) (i.e.methods that involve chain-termination sequencing), as well as “nextgeneration sequencing” techniques that facilitate the sequencing ofthousands to millions of molecules at once. Such methods include, butare not limited to, pyrosequencing, which makes use of luciferase toread out signals as individual nucleotides are added to DNA templates;“sequencing by synthesis” technology (Illumina), which uses reversibledye-terminator techniques that add a single nucleotide to the DNAtemplate in each cycle; and SOLiD™ sequencing (Sequencing byOligonucleotide Ligation and Detection; Life Technologies), whichsequences by preferential ligation of fixed-length oligonucleotides.These next generation sequencing techniques are particularly useful forsequencing whole exomes and genomes.

Once the sequence of the nucleic acid molecule is obtained, single pointsomatic mutations are then identified. Single point mutations may beidentified by comparing the sequence to a control sequence. The controlsequence may be the sequence of a nucleic acid molecule obtained from asample from a control individual, such as a healthy individual that isfree of disease; the sequence of a nucleic acid molecule obtained from acontrol sample, such as a sample from healthy, non-diseased tissue; ormay be a consensus sequence understood to contain no somatic mutations.In addition to identifying the single point mutations, the codoncontaining the mutation and the position of the mutation within thecodon (MC-1, MC-2 or MC-3) is identified. Nucleotides in the flanking 5′and 3′ codons are also identified so as to identify the motifs.Typically, for the methods of the present invention, the sequence of thenon-transcribed strand (equivalent to the cDNA sequence) of the nucleicacid molecules is analyzed). In some instances, the sequence of thetranscribed strand is analyzed.

FIG. 2 shows an example of the analysis that can be performed on nucleicacid from a biological sample as described above to determine whetherAPOBEC3G and/or AID are a likely cause of somatic mutagenesis. In thisexample, the location of single point mutations in the cDNA sequence(wherein the start codon “ATG” comprises the 1st ₂nd _(and) 3rdnucleotides of the molecule) for sample PD3185a have been identified andtheir codon context determined so as to assess how many and what type ofmutations at the GYW/WRC, CG/CG and WA motifs in each position occur.The data is then tabulated and statistical analyzes applied to determinewhether the mutations arose by chance or as a result of targeted somaticmutagenesis caused by AID and/or APOBEC3G. In this example shown in FIG.2, because there are more G>A mutations in the GYW motif at MC-2 sitesand more C>T mutations in the WRC motif at MC-1 sites on thenon-transcribed strand than expected, it is likely that AID is a causeof targeted somatic mutagenesis in this nucleic acid molecule.Furthermore, because there are more G>A mutations in the CG motif atMC-2 sites on the non-transcribed strand than expected, it is likelythat APOBEC3G is also a cause of targeted somatic mutagenesis in thisnucleic acid molecule.

As demonstrated herein, using the methods of the present invention, onlya small number of mutations at motifs need be analyzed to determine withstatistical significance whether targeted somatic mutagenesis hasoccurred as a result of the activity of particular mutagenic agent. Insome instances, the number of mutations at a particular motif analyzedusing the methods of the present invention may be as few as 2 mutations.For example, if it is found that an apparently healthy patient has only2 somatic mutations in the analyzed nucleic acid, and both of these areG>A mutations in a GYW motif at an MC-2 site, then the probability thatthis pattern arose by chance is 0.04238 (p<95%, using a ChiSquare test,9-1=9 df). Alternatively, the probability of each of the mutationsoccurring by chance can be said to be 1/9 (i.e. a 1/3 chance of a G>Amutation, and a 1/3 chance of the mutation being at an MC-2 sites, asdiscussed above), and the probability that 2 out of 2 mutations occur inthis pattern is therefore 1/81 (or 0.012346). However, as would beunderstood by those skilled in the art, statistical significance may beimproved when more mutations at a particular motif analyzed. Thus, insome instances, the number of mutations at a particular motif analyzedusing the methods of the present invention may be at least 20. Manynucleic acid samples from subjects before or after treatment have 40 ormore mutations, with some harboring up to 400 or more mutations.Accordingly, the number of mutations at a particular motif analyzedusing the methods of the present invention may be at least or about 2,3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100,150, 200, 250, 300 or more.

All the essential materials and reagents required for detecting targetedsomatic mutagenesis in a subject and further identifying the likelihoodthat a mutagenic agent is the cause of the targeted somatic mutagenesis,and related methods as described herein, may be assembled together in akit. For example, when the methods of the present invention includefirst isolating and/or sequencing the nucleic acid to be analyzed, kitscomprising reagents to facilitate that isolation and/or sequencing areenvisioned. Such reagents can include, for example, primers foramplification of DNA, polymerase, dNTPs (including labelled dNTPs),positive and negative controls, and buffers and solutions. Such kitsalso generally will comprise, in suitable means, distinct containers foreach individual reagent. The kit can also feature various devices,and/or printed instructions for using the kit.

In some embodiments, the methods described generally herein areperformed, at least in part, by a processing system, such as a suitablyprogrammed computer system. A stand-alone computer, with themicroprocessor executing applications software allowing theabove-described methods to be performed, may be used. Alternatively, themethods can be performed, at least in part, by one or more processingsystems operating as part of a distributed architecture. For example, aprocessing system can be used to identify mutation types, the codoncontext of a mutation and/or motifs within one or more nucleic acidsequences. In some examples, commands inputed to the processing systemby a user assist the processing system in making these determinations.

In one example, a processing system includes at least onemicroprocessor, a memory, an input/output device, such as a keyboardand/or display, and an external interface, interconnected via a bus. Theexternal interface can be utilized for connecting the processing systemto peripheral devices, such as a communications network, database, orstorage devices. The microprocessor can execute instructions in the formof applications software stored in the memory to allow the methods ofthe present invention to be performed, as well as to perform any otherrequired processes, such as communicating with the computer systems. Theapplications software may include one or more software modules, and maybe executed in a suitable execution environment, such as an operatingsystem environment, or the like.

In another example, the processing system can be used to upload sequenceinformation and other relevant data from databases or other sources.Algorithms devised to be appropriate for the methods disclosed hereincan be applied to data, such as shown in FIG. 18. In this example, inputdata [1] and test parameters (such as motifs to be used) [2] areuploaded or entered into the system. A base substitution table is thengenerated for mutations within the genomic region of interest with dataaligned and linked to mutations with codon context data and otherinformation linked to sample details and nucleotide sequence [3, 4, 5].The next step involves the identification of co-incident occurrences ofeach mutation type at each motif at each nucleotide position within thecodons [6]. The data are tabulated to record co-incident occurrences ofeach mutation type off each motif with codon context [7], including therelative likelihood grades with levels of confidence for each diagnosis[8]. The results are linked to identify the mutagenic agents (ormolecular structures) and the biochemical processes likely to beinvolved in producing the mutations and relevant clinical information[8]. An output report is generated according to the service requestinformation used as input [9] and a readable output is generated [10].

4. Diagnostic and Therapeutic Applications

The methods described herein for detecting whether targeted somaticmutation has occurred and determining the likelihood that a mutagenicagent is a cause of somatic mutagenesis of a nucleic acid molecule havemany useful diagnostic and therapeutic applications. Somatic mutagenesisis known to be associated with the development and progression of manycancers. Similarly, some mutagenic agents are known to be associatedwith the development and progression of many cancers. Using the methodsdescribed herein, the presence and/or extent of targeted somaticmutagenesis resulting from one or more mutagenic agents, and theidentity of the mutagenic agent that is the likely cause of somaticmutagenesis, can be determined. This can facilitate early diagnosis ofcancer, a determination of the likelihood that a subject has or willdevelop cancer, and/or development of appropriate therapeutic orpreventative protocols. In addition, ongoing assessment of targetedsomatic mutations attributable to one or more mutagenic agents can beused to assess whether a cancer is progressing or regressing and/or thesuccess or failure of a treatment regimen. For example, an increase inthe number of targeted somatic mutations detected in nucleic acid from asample, such as a biopsy, over time in the same subject can indicate aworsening of the cancer or a failure of a treatment regimen, while astabilization or reduction in the number of mutations can indicateremission of the condition or success of a treatment regimen.

In particular instances, the methods of the present invention can extendto the diagnosis of cancer in a subject or a determination of thelikelihood that a subject has or will develop cancer. For example, thelikelihood that a subject has or will develop cancer can be assessed byanalyzing a nucleic acid molecule from a biological sample from thesubject so as to determine whether targeted somatic mutagenesis by oneor more mutagenic agents has occurred. If targeted somatic mutagenesishas occurred, a determination can be made that the subject is likely tohave or to develop cancer.

In some examples, the diagnostic rules described above are utilized todetermine the likelihood that targeted somatic mutagenesis has occurred.For example, targeted somatic mutagenesis can be detected when thenumber or percentage of observed G to A mutations in GYW motifs at MC-2sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected; the number or percentage of observed C>T mutationsin WRC motifs at MC-1 sites in the non-transcribed strand of the nucleicacid molecule is higher than expected; the number or percentage ofobserved G>A mutations in CG motifs at MC-2 sites in the non-transcribedstrand of the nucleic acid molecule is higher than expected; the numberor percentage of observed C>T mutations in CG motifs at MC-1 sites inthe non-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed C>T mutations in CAmotifs at MC-1 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedG>A mutations in GA motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; the number orpercentage of observed G>A mutations in TG motifs at MC-2 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed G>T mutations in GGmotifs at MC-3 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedC>T mutations in CC motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; or the number orpercentage of observed A>G mutations in WA motifs at MC-2 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected, as described above for AID, APOBEC3G, APOBEC3H, APOBEC1 andaflatoxin. In other examples, diagnostic rules determined for othermutagenic agents using the methods described herein are used to detectthe occurrence of targeted somatic mutation.

In some instances, when targeted somatic mutations are detected in asample containing cells or tissue from a particular region or locationin a subject, such as breast, prostate, liver, colon, stomach,pancreatic, skin, thyroid, cervical, lymphoid, haematopoietic, bladder,lung, renal, rectal, ovarian, uterine, and head or neck tissue or cells,then a determination that the subject has or is likely to develop cancerinvolving that tissue or those cells is made. Thus, for example, adetermination that the subject has or is likely to develop breast,prostate, liver, colon, stomach, pancreatic, skin, thyroid, cervical,lymphoid, haematopoietic, bladder, lung, renal, rectal, ovarian,uterine, or head and neck cancer may be made.

In particular examples, if it is observed that a mutagenic agent, suchas AID or APOBEC3G, is the likely cause of targeted somatic mutagenesisof nucleic acid in prostate tissue or cells, then the subject may bediagnosed with prostate cancer or determined to be likely to have or todevelop prostate cancer. Similarly, if it is observed that a mutagenicagent is the likely cause of targeted somatic mutagenesis of nucleicacid in breast tissue or cells, then the subject may be diagnosed withbreast cancer or determined to be likely to have or to develop breastcancer.

The extent of targeted somatic mutagenesis by the mutagenic agent (i.e.the number of targeted somatic mutations attributable to the mutagenicagent in the nucleic acid) can also be used to assist in determining thelikelihood that the subject has or will develop cancer, the cancer isprogressing or regressing, and/or the treatment is working or not.Typically, the higher the number of targeted somatic mutagenesis, thehigher the likelihood that the subject has or will develop cancer.Furthermore, if there is an increase in the number of targeted somaticmutations over time in a subject, the higher the likelihood that thecancer is progressing and/or the treatment has failed. Conversely, ifthere is a decrease in the number of targeted somatic mutations overtime in a subject, the higher the likelihood that the cancer isregressing and/or the treatment has been successful.

The methods of the present invention also extend to therapeutic orpreventative protocols. In instances where a subject is determined to belikely to develop cancer, protocols designed to reduce that likelihoodmay be designed and applied. For example, if a subject is determined tobe at risk of developing a cancer associated with a particular mutagenicagent, the subject can be advised to reduce exposure to that mutagenicagent. For example, if a subject is determined to be at risk ofdeveloping melanoma, the subject can be advised to reduce exposure to UVradiation. In instances where a subject has been diagnosed with canceror determined to have a high likelihood of developing cancer using themethods described above, an appropriate therapeutic protocol can bedesigned for the subject and administered. This may include, forexample, radiotherapy, surgery, chemotherapy, hormone ablation therapy,pro-apoptosis therapy and/or immunotherapy. In some examples, furtherdiagnostic tests may be performed to confirm the diagnosis prior totherapy.

Radiotherapies include radiation and waves that induce DNA damage forexample, γ-irradiation, X-rays, UV irradiation, microwaves, electronicemissions, radioisotopes, and the like. Therapy may be achieved byirradiating the localized tumor site with the above described forms ofradiations. It is most likely that all of these factors effect a broadrange of damage DNA, on the precursors of DNA, the replication andrepair of DNA, and the assembly and maintenance of chromosomes.

Dosage ranges for X-rays range from daily doses of 50 to 200 roentgensfor prolonged periods of time (3 to 4 weeks), to single doses of 2000 to6000 roentgens. Dosage ranges for radioisotopes vary widely, and dependon the half life of the isotope, the strength and type of radiationemitted, and the uptake by the neoplastic cells.

Non-limiting examples of radiotherapies include conformal external beamradiotherapy (50-100 Grey given as fractions over 4-8 weeks), eithersingle shot or fractionated, high dose rate brachytherapy, permanentinterstitial brachytherapy, systemic radio-isotopes (e.g., Strontium89). In some embodiments the radiotherapy may be administered incombination with a radiosensitizing agent. Illustrative examples ofradiosensitizing agents include but are not limited to efaproxiral,etanidazole, fluosol, misonidazole, nimorazole, temoporfin andtirapazamine.

Chemotherapeutic agents may be selected from any one or more of thefollowing categories:

(i) antiproliferative/antineoplastic drugs and combinations thereof, asused in medical oncology, such as alkylating agents (for examplecis-platin, carboplatin, cyclophosphamide, nitrogen mustard, melphalan,chlorambucil, busulphan and nitrosoureas); antimetabolites (for exampleantifolates such as fluoropyridines like 5-fluorouracil and tegafur,raltitrexed, methotrexate, cytosine arabinoside and hydroxyurea;anti-tumor antibiotics (for example anthracyclines like adriamycin,bleomycin, doxorubicin, daunomycin, epirubicin, idarubicin, mitomycin-C,dactinomycin and mithramycin); antimitotic agents (for example vincaalkaloids like vincristine, vinblastine, vindesine and vinorelbine andtaxoids like paclitaxel and docetaxel; and topoisomerase inhibitors (forexample epipodophyllotoxins like etoposide and teniposide, amsacrine,topotecan and camptothecin);

(ii) cytostatic agents such as antioestrogens (for example tamoxifen,toremifene, raloxifene, droloxifene and iodoxyfene), oestrogen receptordown regulators (for example fulvestrant), antiandrogens (for examplebicalutamide, flutamide, nilutamide and cyproterone acetate), UHantagonists or LHRH agonists (for example goserelin, leuprorelin andbuserelin), progestogens (for example megestrol acetate), aromataseinhibitors (for example as anastrozole, letrozole, vorazole andexemestane) and inhibitors of 5α-reductase such as finasteride;

(iii) agents which inhibit cancer cell invasion (for examplemetalloproteinase inhibitors like marimastat and inhibitors of urokinaseplasminogen activator receptor function);

(iv) inhibitors of growth factor function, for example such inhibitorsinclude growth factor antibodies, growth factor receptor antibodies (forexample the anti-erbb2 antibody trastuzumab [Herceptin™] and theanti-erbb1 antibody cetuximab [C225]), farnesyl transferase inhibitors,MEK inhibitors, tyrosine kinase inhibitors and serine/threonine kinaseinhibitors, for example other inhibitors of the epidermal growth factorfamily (for example other EGFR family tyrosine kinase inhibitors such asN-(3-chloro-4-fluorophenyl)-7-methoxy-6-(3-morpholinopropoxy)quinazolin-4-amine(gefitinib, AZD1839),N-(3-ethynylphenyl)-6,7-bis(2-methoxyethoxy)quinazolin-4-amine(erlotinib, OSI-774) and6-acrylamido-N-(3-chloro-4-fluorophenyl)-7-(3-morpholinopropoxy)quinazoli-n-4-amine (CI 1033)), for example inhibitors of the platelet-derivedgrowth factor family and for example inhibitors of the hepatocyte growthfactor family;

(v) anti-angiogenic agents such as those which inhibit the effects ofvascular endothelial growth factor, (for example the anti-vascularendothelial cell growth factor antibody bevacizumab [Avastin™],compounds such as those disclosed in International Patent ApplicationsWO 97/22596, WO 97/30035, WO 97/32856 and WO 98/13354) and compoundsthat work by other mechanisms (for example linomide, inhibitors ofintegrin αvβ3 function and angiostatin);

(vi) vascular damaging agents such as Combretastatin A4 and compoundsdisclosed in International Patent Applications WO 99/02166, WO00/40529,WO 00/41669, WO01/92224, WO02/04434 and WO02/08213;

(vii) antisense therapies, for example those which are directed to thetargets listed above, such as ISIS 2503, an anti-ras antisense; and

(viii) gene therapy approaches, including for example approaches toreplace aberrant genes such as aberrant p53 or aberrant GDEPT(gene-directed enzyme pro-drug therapy) approaches such as those usingcytosine deaminase, thymidine kinase or a bacterial nitroreductaseenzyme and approaches to increase patient tolerance to chemotherapy orradiotherapy such as multi-drug resistance gene therapy.

Immunotherapy approaches, include for example ex-vivo and in-vivoapproaches to increase the immunogenicity of patient tumor cells, suchas transfection with cytokines such as interleukin 2, interleukin 4 orgranulocyte-macrophage colony stimulating factor, approaches to decreaseT-cell anergy, approaches using transfected immune cells such ascytokine-transfected dendritic cells, approaches usingcytokine-transfected tumor cell lines and approaches usinganti-idiotypic antibodies. These approaches generally rely on the use ofimmune effector cells and molecules to target and destroy cancer cells.The immune effector may be, for example, an antibody specific for somemarker on the surface of a malignant cell. The antibody alone may serveas an effector of therapy or it may recruit other cells to actuallyfacilitate cell killing. The antibody also may be conjugated to a drugor toxin (chemotherapeutic, radionuclide, ricin A chain, cholera toxin,pertussis toxin, etc.) and serve merely as a targeting agent.Alternatively, the effector may be a lymphocyte carrying a surfacemolecule that interacts, either directly or indirectly, with a malignantcell target. Various effector cells include cytotoxic T cells and NKcells.

Examples of other cancer therapies include phototherapy, cryotherapy,toxin therapy or pro-apoptosis therapy. One of skill in the art wouldknow that this list is not exhaustive of the types of treatmentmodalities available for cancer and other hyperplastic lesions.

In some instances, where the likely identity of the mutagenic agentcausing the targeted somatic mutations is determined, therapy orpreventative measures may include administration to the subject of aninhibitor of that mutagenic agent. Inhibitors can include, for example,siRNAs, miRNAs, protein antagonists (e.g. dominant negative mutants ofthe mutagenic agent), small molecule inhibitors, antibodies andfragments thereof. For example, commercially available siRNAs andantibodies specific for APOBEC cytidine deaminases and AID are widelyavailable and known to those skilled in the art. Other examples ofAPOBEC3G inhibitors include the small molecules described by Li et al.(ACS Chem Biol. (2012) 7(3): 506-517), many of which contain catecholmoieties, which are known to be sulfhydryl reactive following oxidationto the orthoquinone. APOBEC1 inhibitors also include, but are notlimited to, dominant negative mutant APOBEC1 polypeptides, such as themul (H61K/C93S/C96S) mutant (Oka et al. (1997) J Biol Chem 272,1456-1460).

Typically, therapeutic agents will be administered in pharmaceuticalcompositions together with a pharmaceutically acceptable carrier and inan effective amount to achieve their intended purpose. The dose ofactive compounds administered to a subject should be sufficient toachieve a beneficial response in the subject over time such as areduction in, or relief from, the symptoms of cancer, and/or thereduction, regression or elimination of tumors or cancer cells. Thequantity of the pharmaceutically active compounds(s) to be administeredmay depend on the subject to be treated inclusive of the age, sex,weight and general health condition thereof. In this regard, preciseamounts of the active compound(s) for administration will depend on thejudgment of the practitioner, and those of skill in the art may readilydetermine suitable dosages of the therapeutic agents and suitabletreatment regimens without undue experimentation.

In order that the invention may be readily understood and put intopractical effect, particular preferred embodiments will now be describedby way of the following non-limiting examples.

EXAMPLES Example 1

Analysis of TP53 somatic mutations in breast cancer

The frequency and context of somatic mutations in the TP53 gene inbreast cancers was assessed by accessing the IARC TP53 database andextracting data specific for breast cancer. The number of pointmutations in this dataset was large (N=2,514). Most of the mutationswere single point mutations, predominantly focused in the DNA bindingregion (codons ˜130-300) of TP53. Only a minor fraction of the samplescarried an exonic mutation in TP53 . It was assumed that there are onlyslight variations due to base composition of TP53, and no correctionswere made. Selection of various criteria facilitated construction andanalysis of all types of mutations with 5′ and 3′ flanking sequencecontext in relation to the unmutated TP53 exon sequence (and in somecases intronic sequence). This facilitated the development of frequencydistributions of various types of mutation (e.g., A-to-G) versusnucleotide and codon position across regions of interest.

The sequences of the cDNA transcripts (i.e. the same sequence context asthe non-transcribed strands) were analyzed. cDNA transcripts were usedas these are publicly available in the COSMIC and Ensembl databases forextraction and analysis purposes. Using these transcripts, the sequencecontext around each mutation was analyzed for mutations at the AID motif(GYW/WRC), APOBEC1 motif (TG/CA) and APOBEC3G motif (CG/CG), as well asthe WA motif, which is representative of potential sites for mutationsat A:T base pairs in phase II of the SHM process (and thus associatedwith AID activity). The mutations were assessed in relation to theirpositions in a mutated codon

FIG. 1 shows an example of a mutated sequence in the defined ‘region ofinterest’ for this analysis. The region of interest includes 9nucleotides encompassing the mutated codon, the flanking 5-prime (5′)codon and the 3-prime (3′) codon. The respective positions of thenucleotides in the mutated codon (MC) sequence are annotated as MC-1,MC-2 and MC-3 (read 5′ to 3′). The respective positions of thenucleotides (N) in the flanking 5′ codon are annotated as 5′N1. 5′N2 and5′N3 respectively (also read 5′ to 3′). Similarly, the positions of thenucleotides in the flanking 3′-codon are annotated as 3′N1, 3′N2 and3′N3 respectively. In the example shown for an A-to-C point mutation(A>C), an A at an MC-1 site on the non-transcribed strand (NTS) ismutated to a C in the replicated non-transcribed strand (NTS′). Themutation of A in the mutated codon is associated with a G in the 5′-N3position.

This is annotated as “S ⋅⋅ A” (where S is a G or C). This annotation isused regardless of the location of a mutation within the mutated codon.

The codon context and frequency of each of the 2,514 somatic mutationsin the TP53 gene from the pooled breast cancer dataset is shown inTable 1. As noted above, MC-1, MC-2 and MC-3 refer to the position ofthe mutations within the mutated codon (MC). These are read 5′ to 3′from the non-transcribed strand. To determine whether codon context wasimportant for each mutation type, a Chi square test was used to teststatistical significance against a cut-off at the P<0.01 level (2 DF).

TABLE 1 Location of all mutations within mutated codon Mutation MC-1 (p,2df) MC-2 (p, 2df) MC-3 (p, 2df) Total A > T 30 29 0 (p < 0.001) 59 A >C 11 30 (p < 0.001) 2 (p < 0.01) 43 A > G 64 194 (p < 0.001) 11 (p <0.001) 269 Total off A 105 253 13 371 T > A 30 29 8 67 T > C 48 64 (p <0.01) 18 (p < 0.001) 130 T > G 23 44 19 86 Total off T 101 137 45 283C > A 23 22 23 68 C > T 397 (p < 0.001) 118 (p < 0.001) 78 (p < 0.001)593 C > G 42 25 22 89 Total off C 462 165 123 750 G > A 203 (p < 0.001)505 (p < 0.001) 87 (p < 0.001) 795 G > T 69 87 37 (p < 0.01) 193 G > C35 67 (p < 0.001) 20 (p < 0.01) 122 Total off G 307 659 144 1110 Total975 1214 325 2514

It was observed that there were far more transitions (i.e. A<>G or C<>T)than transversions (i.e. A or G<>C or T). As a result, the mutationpattern shows significant strand biases where mutations of A exceedmutations of T (371/283=1.3), and mutations of G exceed mutations of C(1110/750=1.5). This is in agreement with previous work showing similarstrand bias patterns for SHM processes in VDJ regions of Ig genes, aswell as protein kinase gene mutation data across the whole genome for arange of non-lymphoid cancers that include breast cancer (Steele andLindley (2010) DNA Repair 9: 600-603). The strand bias pattern is alsoin agreement with mutation data taken from B-cell chronic lymphocyticleukaemia patients (Malcikova et al. (2008) Molecular Immunology 45:1525-9).

The pooled dataset shown in Table 1 also revealed significant mutationcodon bias patterns not previously reported. The most significant codoncontext biases were for transitions C>T (P<0.001, 2DF), G>A (P<0.001,2DF) and A>G (P<0.001, 2DF), which are known to result in the hallmarkstrand bias patterns associated with SHM processes.

It was found that 397/593 (66.9%) of all C>T transitions occurred at anMC-1 site, and 397/750 (52.9%) of all mutations of C (i.e. C>A/G/T) wereC>T transitions at an MC-1 site. In contrast, 505/795 (63.5%) of all G>Atransitions occurred at an MC-2 site, and 505/1110 (45.5%) of allmutations of G (i.e. G>A/C/T) were G>A transitions at an MC-2 site. Ifmutations occur randomly and independently of the codon structure, it isexpected that only 1 in 9 (or around 11.1%) of mutations would occur ata particular site (i.e. MC-1, MC-2 or MC-3) for each of the 3 differenttypes of mutation of a particular nucleotide.

For the A>G transitions, 194/269 (72.1%) of all A>G transitions occurredat an MC-2 site, and 194/371 (52.3%) of all mutations of A (i.e.A>C/G/T) were A>G transitions at an MC-2 site.

The data in Table 1 also support the expectation of selection formissensed mutations in the TP53 gene as the number of mutations in theMC-3 were significantly less than in the MC-1 or MC-2 positions for eachof the transitions C>T, G>A and A>G. For RNA, the nonsense-mediated mRNAdecay (NMD) pathway is one known cellular surveillance system thatrelies on codon context information to enable the cell to identify anddispose of defective gene products containing ‘nonsense’ mutations orSTOP signals (UAG, UGA and UAA) that might prematurely stop translation.The result is selection for missense mutations in TP53. The data is alsoconsistent with another previous study that reported a trend for higherthan expected mutability for codon positions MC-1 and MC-2 incomplementary-determining regions of Ig variable (V) region genes(Shapiro et al. (2002) J. Immunology 168: 2302-2306).

The analysis also revealed a highly significant statistical preferencefor C>T transitions to occur at MC-1 sites (P<0.001, 2DF), and for G>Atransitions to occur at MC-2 sites (P<0.001, 2DF). As cytidines on theTS or the NTS of ssDNA in an open “transcription bubble” are both ableto undergo deamination, the data support the conclusion that themolecular mechanisms involved are able to read in-frame, and distinguishbetween cytidines on the TS and the NTS.

Table 2 shows the codon context of the 2514 somatic mutations for theTP53 breast cancer dataset occurring at AID, APOBEC1 and APOBEC3Gmotifs, as well as WA motifs. A Chi square test was used to determinestatistical significance against a cut-off at the P<0.01 level (2 DF).

If mutations occur independently of the 5′-codon structure, and nocorrection is made for base composition, then it is expected that aroundone third of each mutation type will be located at an MC-1, MC-2 or MC-3site. Similarly, it is expected that only around one ninth (11.1%) ofall mutations of a single nucleotide will be located at an MC-1, MC-2 orMC-3 site. It was found that the codon context bias for transitions atkey motifs associated with AID, APOBEC1 and APOBEC3G activity was evenmore statistically significant than what was found in the pooled datasetshown in Table 1.

TABLE 2 Mutation MC-1 (p, 2df) MC-2 (p, 2df) MC-3 (p, 2df) Total GYW/WRCsites (AID) G > A 9 (p < 0.0001) 185 (p < 0.0001) 6 (p < 0.0001) 200 G >T 3 32 (p < 0.0001) 2 37 G > C 0 10 1 11 C > A 7 9 0 16 C > T 106 (p <0.0001) 13 (p < 0.0001) 13 (p < 0.0001) 132 C > G 3 2 15 (p < 0.01) 20CG/CG sites (APOBEC3G) G > A 46 (p < 0.0001) 358 (p < 0.0001) 3 (p <0.0001) 407 G > T 19 24 2 (p < 0.01) 45 G > C 10 43 (p < 0.0001) 0 (p <0.0001) 53 C > A 6 0 7 13 C > T 240 (p < 0.0001) 6 (p < 0.0001) 2 (p <0.0001) 248 C > G 20 (p < 0.01) 1 6 27 TG/CA sites (APOBEC1) G > A 46 6247 155 G > T 17 38 (p < 0.01) 9 64 G > C 7 4 6 17 C > A 6 2 6 14 C > T93 (p < 0.0001) 16 (p < 0.0001) 51 160 C > G 5 5 4 14 WA sites A > T 5 50 10 A > C 0 15 (p < 0.0001) 1 (p < 0.01) 16 A > G 7 (p < 0.0001) 128 (p< 0.0001) 6 (p < 0.0001) 141

For GYW motifs linked to AID activity, 185/200 (92.5%) of all G>Atransitions occurred at an MC-2 site, and 185/248 (74.6%) of allmutations at GYW sites (i.e. G>A/C/T) were G>A transitions at an MC-2site. In contrast, at WRC sites, 106/132 (80.3%) of all C>T transitionsoccurred at an MC-1 site, and 106/168 (63.1%) of all mutations of C(i.e. C>A/G/T) were C>T transitions at an MC-1 site.

At CG motifs linked to APOBEC3G activity, 358/407 (87.7%) of all G>Atransitions occurred at an MC-2 site, and 358/505 (70.9%) of allmutations at CG sites (i.e. G>A/C/T) were G-to-A transitions at an MC-2site. In contrast, at CG sites, 240/248 (96.8%) of all C>T transitionsoccurred at an MC-1 site, and 240/288 (83.3%) of all mutations of C(i.e. C>A/G/T) were C>T transitions at an MC-1 site.

For the TG/CA motifs linked to APOBEC1 activity, the codon context biaswas not as statistically significant. At CA sites, 93/160 (58.1%) of C>Ttransitions occurred at an MC-1 site, and 93/188 (49.5%) of allmutations of C (i.e. C>A/G/T) were C>T transitions at an MC-1 site. Only62/155 (40.0%) of all G>A transitions at a TG site occurred at an MC-2site, and 62/136 (45.6%) of all mutations of G (i.e. G>A/C/T) at TGsites were G>A transitions at an MC-2 site.

Another feature of the observed codon bias patterns at the key motifsshown in Table 2 is that the majority of all mutations of G for each atthe motifs for AID, APOBEC1 and APOBEC3G preferentially occur at an MC-2site. By comparison, most of the mutations of C for each of the motifsoccurred at an MC-1 target site. This implies that an in-frame sensingmechanism is involved at the level of DNA during the initiation oftranscription, and that it is able to distinguish between cytidines onthe NTS and those on the TS in the context of an open “transcriptionbubble”.

For the A>G transitions at WA sites, 128/141 (90.8%) occurred at an MC-2site, and 128/167 (76.6%) of all mutations of A at WA sites (i.e.A>C/G/T) were G>A transitions at an MC-2 site. As an elevated level ofA>G mutations at WA sites are recognized as a characteristic feature ofSHM activity and diagnostic of the involvement of an RNA templateintermediate, this finding supports a prediction that endogenousAID-initiated mutation processes are active in at least many of thesamples in the dataset.

Table 3 shows the codon context of mutations occurring at key motifsassociated with AID, APOBEC1 and APOBEC3G and co-located with a strongnucleotide (S=G/C) in the 5′ N3 position. The annotation ‘S ⋅⋅ M’ (whereM is the mutated nucleotide A, G, C or T) is used to indicate thepresence of an ‘S’ nucleotide in the 5′N3 position flanking the mutatedcodon, and with the mutated nucleotide target in any one of thepositions MC-1, MC-2 or MC-3. If mutations occur independently of the5′-codon structure, and no correction is made for base composition, thenit is expected that only half of the mutations at each of the motifswill be co-located with an S in the 5′N3 position.

TABLE 3 Mutation MC-1 MC-2 MC-3 Mutations at GYW/WRC (AID) AND S . . .G/C sites G > A 2/9 184/185 4/6 (99.5%) G > T 2/3 31/32 2/2 G > C 0/010/10 1/1 C > A 5/7  0/9* 0/0 C > T 102/106  0/13*  9/13 (96.2%) C > G3/3  0/2*  8/15 Mutations at CG/CG (APOBEC3G) AND S . . . G/C sites G >A  46/46* 352/358 2/3 (98.3%) G > T  19/19* 24/24 2/2 G > C  10/10*41/43 0/0 C > A 6/6 0/0 7/7 C > T 239/240 6/6 2/2 (99.6%) C > G 20/201/1 4/6 Mutations at TG/CA (APOBEC1) AND S . . . G/C sites G > A  0/46*36/62 34/47 (58.1%) G > T  0/17* 27/38 9/9 G > C  0/7* 3/4 5/6 C > A 4/62/2 4/6 C > T 52/93 10/16 45/51 (55.9%) C > G 5/5 5/5 2/4 Mutations atWA AND S . . . A sites A > T  0/5* 4/5 0/0 A > C  0/0* 11/15 0/1 A > G 0/8* 121/127 6/6 (95.3%) *It is impossible for a nucleotide in the 5′N3position to be both ‘S’ and WA or TG for mutations at the MC-1 position.Similarly, the nucleotides in the 5′N3 position cannot be ‘S’ and WRCfor mutations at the MC-2 position, and all mutations at a CG site atthe MC-1 position have an ‘S’ in the 5′N3 position.

The analysis revealed an unexpectedly high linkage between S⋅⋅M sitesand transitions at motifs associated with AID, APOBEC3G activity and atWA sites, but not APOBEC1 sites. For the GYW/WRC motifs associated withAID activity, 184/185 (99.5%) of all G>A transitions in the MC-2position had an S present in the 5′N3 position, and 102/106 (96.2%) ofall C>T transitions in the MC-1 position had an S present in the 5′N3position. For the CG/CG motifs associated with APOBEC3G activity,352/358 (98.3%) of all G>A transitions in the MC-2 position had an Spresent in the 5′N3 position, and 239/240 (99.6%) of all C-to-Ttransitions in the MC-1 position had an S present in the 5′N3 position.For the TG/CA motifs associated with APOBEC1 activity, the results werenot statistically significant. Only 36/62 (58.1%) of G>A transitions atan MC-2 site had an S present in the 5′N3 position, and 52/93 (55.9%) ofC-to-T transitions in the MC-1 position had an S present in the 5′N3position. For WA sites, 121/127 (95.3%) of A-to-G transitions at an MC-2site had an S present in the 5′N3 position.

The data in Table 3 also reveal an unexpectedly high proportion of someof the transversions at the selected motifs being co-located with an Sin the 5′N3 position. In particular, there is a higher than expectedlinkage between G-to-T/C mutations at GYW or CG target sites and an S ⋅⋅G site. For all transitions and transversions of G occurring at an MC-2target for the selected AID, APOBEC3G and WA motifs, a highlysignificant 778/799 (97.4%) are co-located with an S ⋅⋅ G. Similarly,375/382 (98.2%) of all transitions and transversions of C occurring atan MC-1 target site for the selected AID and APOBEC3G motifs, areco-located with an S ⋅⋅ C.

The co-location of an S ⋅⋅ M (M =A/G/C/T) therefore appears to be anintegral part of the direct contact binding and codon reading framesensor mechanisms associated with AID and APOBEC3G deaminase activity,as well as the mutator mechanism(s) acting on WA sites.

Example 2 Development of Diagnostic Rules to Predict the Activity ofAID, APOBEC1 or APOBEC3G

The codon bias patterns observed for mutation at the AID, APOBEC1,APOBEC3G and WA motifs (described above) were used to generate thefollowing “rules” or diagnostic criteria for use in predicting whethertargeted somatic mutation (TSM) of a nucleic acid molecule is occurringas a result of AID, APOBEC1 and/or APOBEC3G activity:

A higher than expected number of G>A mutations off the GYW (AID) motifat MC-2 sites is associated with AID deaminase activity on thetranscribed strand.

A higher than expected number of C>T mutations off the WRC (AID) motifat MC-1 sites is associated with AID deaminase activity on thenon-transcribed strand.

A higher than expected number of G>A mutations off the CG (APOBEC3G)motif at MC-2 sites is associated with APOBEC3G activity.

A higher than expected number of C>T mutations off the CG (APOBEC3G)motif at MC-1 sites is associated with APOBEC3G activity.

A higher than expected number of A>G mutations off the WA motif at MC-2is an indication of AID-linked mutation processes and thus AID activity.

When applying these rules, it is assumed that the set of mutations offeach nucleotide are independent of each other, and that if the mutagenicagents are not present, the distribution of mutations off eachnucleotide in each of the codon sites MC-1 and MC-2 will be randomlydistributed for mutations off A, G, C or T.

FIG. 2 shows an example of how the above diagnostic criteria can be usedto determine the probability that the codon-bias mutation distributionarose by chance or by targeted somatic mutation by AID or APOBEC3G. Foreach of the above selected diagnostic categories, the number of Observed(0) and the Expected (E) mutations are tabulated in Table form. For eachof the diagnostic categories, the number of Expected (E) mutations iscalculated using the total number of mutations likely to arise acrossMC-1 and MC-2 sites for each of the 3 possible types of mutation off aparticular nucleotide if the mutations are random. (When analyzingmutations of the TP53 gene, as shown in FIG. 2, mutations occurring atMC-3 sites were excluded as a comparator as mutated variants of the TP53gene have been selected for binding function. The nonsense-mediatedmessenger RNA decay (NMCD) pathway involved is one known cellularsurveillance system that relies on codon context information to enablethe cell to identify and dispose of defective gene products containingnonsense mutations or STOP signals that might prematurely stoptranslation). For example, in regards to the WRC motif that isassociated with AID activity resulting in mutations of cytosine (C) offthe non-transcribed strand, if the number of mutations at C wererandomly distributed, the mutations would be evenly distributed acrossthe MC-1/MC-2 sites and across C>A, C>G and C>T (C>A/G/T). Thus, in thisexample the Expected (E) number of C>T mutations at an MC-1 site is thetotal number of observed C>A/T/G mutations at MC-1 and MC-2 sites (i.e.1+1+72+6+1+1), divided by the number of possible types/positions ofmutations (i.e. 6), which equals 13.67. A simple CHISQUARE test is thenapplied to determine the probability that the observed distributionarose by chance. In the example shown in FIG. 3, the probability thatthe MC-1/MC-2 codon-bias distribution arose by chance for the selectedset of diagnostic criteria applied to the mutation set for the TET2 geneis 7.42E-128. This result implies a very high level of significance(P<1E-127).

Referring again to FIG. 3, the higher than expected number of G>Amutations off the GYW motif at MC-2 sites and higher than expectednumber of C>T mutations off the WRC motif at MC-1 sites indicates AIDdeaminase activity, while the higher than expected number of G>Amutations off the CG motif at MC-2 sites and higher than expected numberof C>T mutations off the CG motif at MC-1 sites indicates APOBEC3Gactivity.

Example 3 Analysis of TP53 Somatic Mutations in Other Cancers

To determine whether the codon bias for mutations at the AID, APOBEC3Gand WA motifs observed in TP53 in breast cancer samples also occurs inTP53 in other cancers, data was extracted from the IARC TP53 databasefor cervical cancer (all types), cervical adenocarcinoma, colonadenocarcinoma, hepatocellular carcinoma, pancreatic cancer, prostatecancer, and malignant melanoma and analyzed as described above.

FIGS. 4-11 show the frequency and location within codons of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites. As shown inthese figures, the codon bias patterns for mutations at the AID,APOBEC3G and/or WA motifs in TP53 were observed in each of thesecancers, indicating that there is a statistically very high likelihoodthat a wide range cancers with TP53 mutations are associated withAID/APOBEC deaminase activity.

Example 4 Analysis of Somatic Mutations Attributable to AID or APOBEC3Gin PIK3CA and TET2

The frequency and codon context of somatic mutations at AID, APOBEC3Gand WA motifs in observed in PIK3CA from breast cancer tissue samplesand TET2 from .haematopoietic and lymphoid tissue samples was analyzedusing aggregate sample data for different patient cohorts sourced fromthe COSMIC database. As shown in FIGS. 12 and 13, the frequency andcodon context of somatic mutations at AID, APOBEC3G and WA motifsindicated that AID and APOBEC3G were active in these tissues, and thelikely cause of a significant number of observed somatic mutations.

Example 5 Analysis of Whole Exomes from Samples from Subjects withAdenoid Cystic Carcinoma

The diagnostic criteria described above was used to assess thelikelihood that AID and/or APOBEC3G were involved in targeted somaticmutagenesis in cells from adenoid cystic carcinoma (ACC) tissue ofpatients. Sequence data was obtained from a study in which whole exomesequencing was performed on 23 pretreatment primary ACC specimens and 1local-regional lymph node metastasis, as well as corresponding matchingnormal salivary gland parenchymal samples (Stephens et al. (2013) J ClinInvest.123(7):2965-2968). The exome sequencing identified 312 mutations,with a mean of 13 mutations per exome, which is relatively few comparedto other solid tumors. The somatic mutations were analyzed as describedabove to determine the frequency and codon location of mutations atGYW/WRC sites (AID), CG/CG sites (APOBEC3G) and WA sites.

FIG. 14 shows representative analyzes of the mutations found in twopatient samples: PD3185a and PD3181a. Applying the diagnostic criteria,it was observed that targeted somatic mutation occurred in nucleic acidfrom the PD3185a sample, and that both AID and APOBEC3G were likely tobe active in cells from this sample and the cause of the targetedsomatic mutagenesis. In contrast, in the sample that had the highestnumber of somatic mutations (PD3181a), no evidence of targeted somaticmutation was observed, with no indication that either AID or APOBEC3Gwere responsible for the somatic mutations present in the nucleic acidof this sample.

Overall, it was found that only 9 out of 24 of the examined ACC sampleswere positive for targeted somatic mutagenesis resulting from AID and/orAPOBEC3G activity (Table 4). There was no correlation between the numberof mutations and targeted somatic mutagenesis, or between the MYBactivation score. This MYB activation score was derived to indicatewhether or not a particular sample has fusions of the MYB-NFIB genes(Stephens et al. (2013) J Clin Invest.123(7):2965-2968).

TABLE 4 TSM analysis MYB AID and/or Stat. Sample activa- MutationsAPOBEC3G Signif. ID Histology tion (n) activity (p value) PD3178aCribriform Yes 7 Negative NA PD3179a Cribriform Yes 10 Negative NAPD3180a Solid Yes 13 Negative NA PD3181a Solid Yes 23 Negative NAPD3182a Cribriform Yes 7 Negative NA PD3184a Solid Yes 5 Negative NAPD3185a Solid Yes 11 Positive 0.0008 PD3186a Cribriform Yes 13 Positive0.0293 with solid PD3188a Solid Yes 14 Negative NA PD3189a Solid No 8Negative NA PD3190a Solid No 17 Positive 0.0039 PD3191a Solid Yes 11Negative NA PD3192a Solid Yes 6 Negative NA PD3193a Cribriform No 16Negative NA PD3194a Cribriform No 8 Negative NA PD3195a Solid Yes 10Positive 0.0132 PD3196a Cribriform Yes 12 Positive 0.0032 PD3197aCribriform Yes 7 Positive 0.0039 PD3198a Cribriform Yes 2 Negative NAPD3199a Cribriform Yes 8 Positive 0.0109 PD3200a Solid No 12 Positive0.0019 PD3208a Solid Yes 7 Positive  0.000045 PD3216a Solid Yes 3Negative NA PD3226a Solid Yes 7 Negative NA

Example 6 Analysis of Whole Exomes from Samples from Subjects withProstate Carcinoma

Exome-wide mutations data from four prostate carcinoma samples wasobtained from the COSMIC database (Wellcome Trust Sanger Institute;http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/) and analyzedas described above to determine whether the nucleic acid in the samplescontained targeted somatic mutations resulting from AID and/or APOBEC3Gactivity. Two of the samples were from autopsied patients withmetastatic castration-resistant prostate cancer (CRPC), and the othertwo samples were from patients with pT2c and pT3a stage prostate cancer,respectively.

As summarized in Table 5, three of the samples were found to be positivefor targeted somatic mutation resulting from AID and/or APOBEC3Gactivity. Interestingly, targeted somatic mutation was observed insubjects with low PSA samples, indicating that this type of analysiscould be used for the early detection of prostate carcinoma before PSAlevels start to rise.

FIG. 15A and FIG. 15B show the individual analyzes of the mutationsfound in the four patient samples. In addition to indications of AIDand/or APOBEC3G activity, high numbers of G>T mutations at MC-1 sitesand C>T mutations at MC-3 sites in the PR-09-3421 sample, and highnumbers of G>A mutations and C>T mutations in the PR-2762 sample suggestthat other APOBEC deaminases may be active in these patients.

TABLE 5 TSM analysis AID and/or Stat. Mutations APOBEC3G Signif. SampleID Stage of cancer (n) activity (p value) WA7 Autopsy, CRPC 41 positive0.000127 WA26 Autopsy, CRPC 115 positive 0.010397 PR-09-3421 pT3a serumPSA 49 positive 5.1E−05 stage (ng/mL) - 4.8 PR-2762 pT3a serum PSA 42negative NA stage (ng/mL) - 5.5

Example 7 Identification of an Aflatoxin Motif

A G>T transversion at the third position of codon 249 in TP53 is linkedto aflatoxin, an exogenous mutagenic agent from Aspergillus sp., and hasbeen used as a diagnostic marker. As shown in FIG. 5, there are a veryhigh number of G>T mutations at the MC-3 sites in combination with GGmotifs in TP53 genes from hepatocellular carcinoma (HCC) samples. Toinvestigate this further, a whole-exome sample (HCC53T) from the COSMICdatabase was analyzed for G>T mutations. It was observed that there were9 G>T mutations at an MC-3 site in the whole exome, each co-incidentwith a GG motif. This suggests that aflatoxin causes G>T mutations at anMC-3 site off a GG motif.

Example 8 Development of Diagnostic Rules to Predict the Activity ofAPOBEC3H

APOBEC3H is thought to target a GA motif. To further analyze the codoncontext of mutations at this motif, the whole exome from tissue from asubject with bladder carcinoma was (sequence obtained from the COSMICdatabase) was analyzed. As shown in FIG. 16, there was a predominance ofG>A mutations at MC-1 sites, indicating that APOBEC3H preferentiallytargets mutations to the G in GA motifs when the G is at an MC-3 site,resulting in G>A mutations.

Example 9 Development of Diagnostic Rules to Predict the Activity ofAPOBEC3H

APOBEC3G has been suggested as targeting a CC motif in addition theCG/CG motif. To further analyze the codon context of mutations at the CCmotif, the whole exomes from tissue from 8 subjects with bladdercarcinoma prior to treatment (sequences obtained from the COSMICdatabase) were analyzed. The sequences of the whole exomes from 8subjects (B2, B5, B8-10, B13, B15 and B20) were analyzed as pooled data(FIG. 17A) and the sequence of the whole exome from one subject (B13)was analyzed independently (FIG. 17A). As shown in FIGS. 17A and B,there was a statistically significant predominance of C>T mutations atMC-1 sites, indicating that APOBEC3G preferentially targets mutations tothe C in CC motifs when the targeted C is at a MC-1 site, resulting inC>T mutations.

The disclosure of every patent, patent application, and publicationcited herein is hereby incorporated herein by reference in its entirety.

The citation of any reference herein should not be construed as anadmission that such reference is available as “Prior Art” to the instantapplication.

Throughout the specification the aim has been to describe the preferredembodiments of the invention without limiting the invention to any oneembodiment or specific collection of features. Those of skill in the artwill therefore appreciate that, in light of the instant disclosure,various modifications and changes can be made in the particularembodiments exemplified without departing from the scope of the presentinvention. All such modifications and changes are intended to beincluded within the scope of the appended claims.

What is claimed is:
 1. A method for determining the likelihood thattargeted somatic mutagenesis of a nucleic acid molecule by a mutagenicagent has occurred, the method comprising: analyzing the sequence of thenucleic acid molecule to determine, for a plurality of mutations of amutation type at one or more motifs recognized or targeted by themutagenic agent, the codon context of those mutations to therebyidentify the location of a mutation and mutation type for each of aplurality of mutated codons in the nucleic acid molecule, wherein thecodon context of an individual mutation is determined by determining atwhich of the three positions of a corresponding mutated codon theindividual mutation occurs; and determining that targeted somaticmutagenesis is likely to have occurred when there is a higher thanexpected percentage or number of mutations of a mutation type at one ofthe three positions in the plurality of mutated codons; wherein themutagenic agent is selected from among aflatoxin, activation-inducedcytidine deaminase (AID), and an apolipoprotein B mRNA-editing enzymecatalytic polypeptide-like (APOBEC) cytidine deaminase.
 2. The method ofclaim 1, wherein the expected percentage or number of mutations iscalculated by assuming that mutations occur independently of codoncontext.
 3. The method of claim 2, wherein the expected percentage ofmutations is approximately 11% or 17%.
 4. The method of claim 2, whereinthe expected number of mutations is approximately 1 of every 9 mutationsor 1 of every 6 mutations.
 5. The method of claim 1, wherein thepercentage of mutations is observed to be at least 30%, 35%, 40%, 45%,50%, 55%, 60%, 65%, 70%, 80%, 85%, 90%, 95% or more.
 6. The method ofclaim 1, wherein the APOBEC cytidine deaminase is selected from amongAPOBEC1, APOBEC3A, APOBEC3B, APOBEC3C, APOBEC3D, APOBEC3F, APOBEC3G andAPOBEC3H.
 7. The method of claim 1, wherein targeted somatic mutagenesisis determined to be likely to have occurred when: the number orpercentage of observed G to A mutations in GYW motifs at MC-2 sites inthe non-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed C>T mutations in WRCmotifs at MC-1 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedG>A mutations in CG motifs at MC-2 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; the number orpercentage of observed C>T mutations in CG motifs at MC-1 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed C>T mutations in CAmotifs at MC-1 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedG>A mutations in GA motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; the number orpercentage of observed G>A mutations in TG motifs at MC-2 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected; the number or percentage of observed G>T mutations in GGmotifs at MC-3 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected; the number or percentage of observedC>T mutations in CC motifs at MC-1 sites in the non-transcribed strandof the nucleic acid molecule is higher than expected; or the number orpercentage of observed A>G mutations in WA motifs at MC-2 sites in thenon-transcribed strand of the nucleic acid molecule is higher thanexpected.
 8. The method of claim 7, wherein the mutagenic agent is AIDif the number or percentage of observed G>A mutations in GYW motifs atMC-2 sites, A>G mutations in WA motifs at MC-2 sites, and/or C>Tmutations in WRC motifs at MC-1 sites in the non-transcribed strand ofthe nucleic acid molecule is higher than expected.
 9. The method ofclaim 7, wherein the mutagenic agent is APOBEC3G if the number orpercentage of observed G>A mutations in CG motifs at MC-2 sites, C>Tmutations in CC motifs at MC-1 sites or C>T mutations in CG motifs atMC-1 sites in the non-transcribed strand of the nucleic acid molecule ishigher than expected.
 10. The method of claim 7, wherein the mutagenicagent is APOBEC1 if the number or percentage of observed C>T mutationsin CA motifs at MC-1 sites or G>A mutations in TG motifs at MC-2 sitesin the non-transcribed strand of the nucleic acid molecule is higherthan expected.
 11. The method of claim 7, wherein the mutagenic agent isAPOBEC3H if the number or percentage of observed G>A mutations in GAmotifs at MC-1 sites in the non-transcribed strand of the nucleic acidmolecule is higher than expected.
 12. The method of claim 7, wherein themutagenic agent is aflatoxin if the number or percentage of observed G>Tmutations in GG motifs at MC-3 sites in the non-transcribed strand ofthe nucleic acid molecule is higher than expected.
 13. The method ofclaim 1, comprising first isolating the nucleic acid molecule.
 14. Themethod of claim 1, comprising sequencing all or a part of the nucleicacid molecule.
 15. The method of claim 1, wherein the nucleic acidmolecule comprises all or part of a single gene or the cDNA of a singlegene, or all or part of two or more genes or the cDNA of two or moregenes.
 16. The method of claim 15, wherein the gene is a gene associatedwith cancer.
 17. The method of claim 15, wherein the gene is selectedfrom among TP53, PIK3CA, ERBB2, DIRAS3, TET2 and nitric oxide synthase(NOS) genes.
 18. The method of claim 1, wherein nucleic acid moleculesthat constitute the whole exome or the whole genome of a cell areanalyzed.
 19. The method of claim 1, wherein all or a part of the methodis performed by a processing system.